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Chapter 1: Getting started with R 
Language 


Section 1.1: Installing R 


You might wish to install RStudio after you have installed R. RStudio is a development environment for R that 
simplifies many programming tasks. 


Windows only: 


Visual Studio (starting from version 2015 Update 3) now features a development environment for R called R Tools, 
that includes a live interpreter, Intellisense, and a debugging module. If you choose this method, you won't have to 
install R as specified in the following section. 


For Windows 


1. Go to the CRAN website, click on download R for Windows, and download the latest version of R. 
2. Right-click the installer file and RUN as administrator. 

3. Select the operational language for installation. 

4. Follow the instructions for installation. 


For OSX / macOS 
Alternative 1 


(0. Ensure XQuartz is installed ) 


1. Go to the CRAN website and download the latest version of R. 
2. Open the disk image and run the installer. 
3. Follow the instructions for installation. 


This will install both R and the R-MacGUI. It will put the GUI in the /Applications/ Folder as R.app where it can either 
be double-clicked or dragged to the Doc. When a new version is released, the (re)-installation process will overwrite 
R.app but prior major versions of R will be maintained. The actual R code will be in the 
/Library/Frameworks/R.Framework/Versions/ directory. Using R within RStudio is also possible and would be using 
the same R code with a different GUI. 


Alternative 2 


1. Install homebrew (the missing package manager for macOS) by following the instructions on https://brew.sh/ 
2. brew install R 


Those choosing the second method should be aware that the maintainer of the Mac fork advises against it, and will 
not respond to questions about difficulties on the R-SIG-Mac Mailing List. 


For Debian, Ubuntu and derivatives 


You can get the version of R corresponding to your distro via apt-get. However, this version will frequently be quite 
far behind the most recent version available on CRAN. You can add CRAN to your list of recognized "sources". 


sudo apt-get install r-base 


You can get a more recent version directly from CRAN by adding CRAN to your sources list. Follow the directions 
from CRAN for more details. Note in particular the need to also execute this so that you can use 
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install. packages(). Linux packages are usually distributed as source files and need compilation: 


sudo apt-get install r-base-dev 
For Red Hat and Fedora 
sudo dnf install R 


For Archlinux 
R is directly available in the Extra package repo. 
sudo pacman -S r 


More info on using R under Archlinux can be found on the ArchWiki R page. 


Section 1.2: Hello World! 


"Hello World!" 


Also, check out the detailed discussion of how, when, whether and why to print a string. 


Section 1.3: Getting Help 


You can use function help() or ? to access documentations and search for help in R. For even more general 
searches, you can use help. search() or ??. 


#For help on the help function of R 
help() 


#For help on the paste function 
help(paste) #0R 
help("paste") #0R 

?paste #0R 

?"paste" 


Visit https://www.r-project.org/help.html for additional information 


Section 1.4: Interactive mode and R scripts 

The interactive mode 

The most basic way to use R is the interactive mode. You type commands and immediately get the result from R. 
Using R as a calculator 


Start R by typing R at the command prompt of your operating system or by executing RGui on Windows. Below you 
can see a screenshot of an interactive R session on Linux: 


Goalkicker.com - R Notes for Professionals a 


user:~$ R 


R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch" 
Copyright (C) 2016 The R Foundation for Statistical Computing 
Platform: x86 _64-pc-linux-gnu (64-bit) 


R ist freie Software und kommt OHNE JEGLICHE GARANTIE. 
Sie sind eingeladen, es unter bestimmten Bedingungen weiter zu verbreiten. 
Tippen Sie ‘license()' or ‘licence()' fiir Details dazu. 


R ist ein Gemeinschaftsprojekt mit vielen Beitragenden. 
Tippen Sie ‘contributors()' fiir mehr Information und ‘citation()', 
um zu erfahren, wie R oder R packages in Publikationen zitiert werden k6nnen. 


Tippen Sie ‘demo()' fiir einige Demos, ‘help()' fiir on-line Hilfe, oder 
*help.start()* fiir eine HTML Browserschnittstelle zur Hilfe. 
Tippen Sie ‘q()', um R zu verlassen. 


> 1+1 
Eliic2 
> 


This is RGui on Windows, the most basic working environment for R under Windows: 
GR RGui (64-bit) 
File Edit View Misc Packages Windows Help 


R version 3.4.0 Patched (2017-05-25 r72746) -- "You Stupid Darkness" 
Copyright (C) 2017 The R Foundation for Statistical Computing 
Platform: x86 64-w64-mingw32/x64 (64-bit) 


R is free software and comes with ABSOLUTELY NO WARRANTY. 
You are welcome to redistribute it under certain conditions. 
Type ‘license()' or ‘licence()' for distribution details. 


Natural language support but running in an English locale 
R is a collaborative project with many contributors. 


Type ‘contributors()' for more information and 
*citation()" on how to cite R or R packages in publications. 


Type 'demo()' for some demos, 'help()' for on-line help, or 
"help.start()' for an HTML browser interface to help. 
Type 'q()' to quit R. 


> 1+1 
Eid 2 
> 


< 


After the > sign, expressions can be typed in. Once an expression is typed, the result is shown by R. In the 
screenshot above, R is used as a calculator: Type 
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1+1 


to immediately see the result, 2. The leading [1] indicates that R returns a vector. In this case, the vector contains 
only one number (2). 


The first plot 


R can be used to generate plots. The following example uses the data set PlantGrowth, which comes as an example 
data set along with R 


Type int the following all lines into the R prompt which do not start with ##. Lines starting with ## are meant to 
document the result which R will return. 


data(PlantGrowth) 

str(PlantGrowth) 

## ‘data.frame': 30 obs. of 2 variables: 

HH Oe WeAghicanUM 47 Oh oS rom. Omi eAn > Ae Olle ony 4653 soso com Aekeae 
HH SaGrOUP EsaRACtOROW) 3 levelicaa Gil atict Ile wesc leolh stew deh lie 


anova(1m(weight ~ group, data = PlantGrowth) ) 
## Analysis of Variance Table 


## 

## Response: weight 

## Df Sum Sq Mean Sq F value Pr(>F) 
## group 2 3.7663 1.8832 4.8461 0.01591 * 
## Residuals 27 10.4921 0.3886 

HH --- 


## Signif. codes: @ ‘**x*’ @.001 ‘**’ 0.01 ‘* @.05 ‘.’ 0.1 ‘7 1 
boxplot(weight ~ group, data = PlantGrowth, ylab = "Dry weight") 


The following plot is created: 


Dry weight 


ctrl ti trt2 


data(PlantGrowth) loads the example data set PlantGrowth, which is records of dry masses of plants which were 
subject to two different treatment conditions or no treatment at all (control group). The data set is made available 
under the name PlantGrowth. Such a name is also called a Variable. 


To load your own data, the following two documentation pages might be helpful: 


e Reading and writing tabular data in plain-text files (CSV, TSV, etc.) 
e I/O for foreign tables (Excel, SAS, SPSS, Stata) 


str(PlantGrowth) shows information about the data set which was loaded. The output indicates that PlantGrowth 
is a data. frame, which is R's name for a table. The data. frame contains of two columns and 30 rows. In this case, 
each row corresponds to one plant. Details of the two columns are shown in the lines starting with $: The first 
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column is called weight and contains numbers (num, the dry weight of the respective plant). The second column, 
group, contains the treatment that the plant was subjected to. This is categorial data, which is called factor in R. 
Read more information about data frames. 


To compare the dry masses of the three different groups, a one-way ANOVA is performed using anova(1m( ... )). 
weight ~ group means "Compare the values of the column weight, grouping by the values of the column group". 
This is called a Formulain R. data = ... specifies the name of the table where the data can be found. 


The result shows, among others, that there exists a significant difference (Column Pr(>F)), p = @.01591) between 
some of the three groups. Post-hoc tests, like Tukey's Test, must be performed to determine which groups' means 
differ significantly. 


boxplot(...) creates a box plot of the data. where the values to be plotted come from. weight ~ group means: 
"Plot the values of the column weight versus the values of the column group. ylab = ... specifies the label of the y 
axis. More information: Base plotting 


Type q() or [ctri }| D | to exit from the R session. 


R scripts 


To document your research, it is favourable to save the commands you use for calculation in a file. For that effect, 
you can create R scripts. An R script is a simple text file, containing R commands. 


Create a text file with the name plants.R, and fill it with the following text, where some commands are familiar 
from the code block above: 


data(PlantGrowth) 
anova(Im(weight ~ group, data = PlantGrowth) ) 
png("plant_boxplot.png", width = 400, height = 300) 


boxplot(weight ~ group, data = PlantGrowth, ylab = "Dry weight") 
dev .off() 


Execute the script by typing into your terminal (The terminal of your operating system, not an interactive R session 
like in the previous section!) 


R --no-save <plant.R >plant_result.txt 


The file plant_result.txt contains the results of your calculation, as if you had typed them into the interactive R 
prompt. Thereby, your calculations are documented. 


The new commands png and dev. off are used for saving the boxplot to disk. The two commands must enclose the 
plotting command, as shown in the example above. png("FILENAME", width = ..., height = ...) opensanew 
PNG file with the specified file name, width and height in pixels. dev. of f() will finish plotting and saves the plot to 
disk. No output is saved until dev. off() is called. 
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Chapter 2: Variables 


Section 2.1: Variables, data structures and basic Operations 


In R, data objects are manipulated using named data structures. The names of the objects might be called 
"variables" although that term does not have a specific meaning in the official R documentation. R names are case 
sensitive and may contain alphanumeric characters(a-z,A-z,@-9), the dot/period(.) and underscore(_). To create 
names for the data structures, we have to follow the following rules: 


e¢ Names that start with a digit or an underscore (e.g. 1a), or names that are valid numerical expressions (e.g. 
.11), or names with dashes ('-') or spaces can only be used when they are quoted: *1a* and ~.11°. The 
names will be printed with backticks: 


list( '.11' ="a") 
ea ell’ 
Hal) at 


e All other combinations of alphanumeric characters, dots and underscores can be used freely, where 
reference with or without backticks points to the same object. 


e Names that begin with . are considered system names and are not always visible using the 1s()-function. 


There is no restriction on the number of characters in a variable name. 
Some examples of valid object names are: foobar, foo.bar, foo_bar, . foobar 


In R, variables are assigned values using the infix-assignment operator <-. The operator = can also be used for 
assigning values to variables, however its proper use is for associating values with parameter names in function 
calls. Note that omitting spaces around operators may create confusion for users. The expression a<-1 is parsed as 
assignment (a <- 1) rather than as a logical comparison (a < -1). 


> foo <- 42 
> fooEquals = 43 


So foo is assigned the value of 42. Typing foo within the console will output 42, while typing fooEquals will output 
43. 


> foo 
[1] 42 
> fooEquals 
[1] 43 


The following command assigns a value to the variable named x and prints the value simultaneously: 


> (x <- 5) 
[1] 5 
# actually two function calls: first one to *<-*; second one to the *()°-function 


> is. function(*(°) 
[1] TRUE # Often used in R help page examples for its side-effect of printing. 


It is also possible to make assignments to variables using ->. 


> 5 -> x 
> X 
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[1] 5 


Types of data structures 


There are no scalar data types in R. Vectors of length-one act like scalars. 


Vectors: Atomic vectors must be sequence of same-class objects.: a sequence of numbers, or a sequence of 
logicals or a sequence of characters. v <- c(2, 3, 7, 1@),v2 <- e("a", “b", “c") are both vectors. 
Matrices: A matrix of numbers, logical or characters. a <- matrix(data = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 
10, 11, 12), nrow = 4, ncol = 3, byrow = F). Like vectors, matrix must be made of same-class 
elements. To extract elements from a matrix rows and columns must be specified: a[1,2] returns [1] 5 that 
is the element on the first row, second column. 

Lists: concatenation of different elements mylist <- list (course = ‘'stat', date = '@4/07/2009', 
num_isc = 7, num_cons = 6, num_mat = as.character(c(45020, 45679, 46789, 43126, 42345, 47568, 
45674)), results = c(30, 19, 29, NA, 25, 26 ,27) ). Extracting elements from a list can be done by 
name (if the list is named) or by index. In the given example mylist$results and mylist[[6]] obtains the 
same element. Warning: if you try mylist[6], R won't give you an error, but it extract the result as a list. 
While mylist[[6]][2] is permitted (it gives you 19), mylist[6][2] gives you an error. 

data.frame: object with columns that are vectors of equal length, but (possibly) different types. They are not 
matrices. exam <- data.frame(matr = as.character(c(45020, 45679, 46789, 43126, 42345, 47568, 
45674) WeseS >= 1G(3G) 119) 29. (NAS 25,9265) 27) fes-0)= ¢(3, 3, 1 INA, 3, 2, NA)) res 10m := 
c(30,22,30,NA, 28,28, 27)). Columns can be read by name examSmatr, exam[, ‘matr'] or by index exam[1], 
exam[, 1]. Rows can also be read by name exam[' rowname', ] or index exam[1, ]. Dataframes are actually 
just lists with a particular structure (rownames-attribute and equal length components) 


Common operations and some cautionary advice 


Default operations are done element by element. See ?Syntax for the rules of operator precedence. Most 
operators (and may other functions in base R) have recycling rules that allow arguments of unequal length. Given 
these objects: 


Example objects 


vvvVVV VV 


=noa.aqa9da TT Dw 


Z 


<a 

Sad, 

<- ¢(2,3,4) 

<- c(10,10,10) 

<- ¢e(1,2,3,4) 

<6 

<- cbhind(1:4,5:8,9:12) 


<- rbind(rep(@,3),1:3,rep(10,3),¢(4,7,1)) 


Some vector operations 


> atb # scalar + scalar 

[1] 3 

> etd # vector + vector 

[el azo tS 4 

> axb # scalar * scalar 

a a 

> cxd # vector * vector (componentwise! ) 
[1] 20 30 40 

> cta # vector + scalar 

[ee Sr4es 

> cA2 # 

[1] 


> exp(c) 
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4 9 16 


[1] 7.389056 20.085537 54.598158 
Some vector operation Warnings! 


> cte # warning but.. no errors, since recycling is assumed to be desired. 
[eon 736 

Warning message: 

In c + e : longer object length is not a multiple of shorter object length 


R sums what it can and then reuses the shorter vector to fill in the blanks... The warning was given only because the 
two vectors have lengths that are not exactly multiples. c+f # no warning whatsoever. 


Some Matrix operations Warning! 
> Z+W # matrix + matrix #(componentwise) 
> Z*xW # matrix* matrix#(Standard product is always componentwise) 


To use a matrix multiply: V %*% W 


> W+ a # matrix+ scalar is still componentwise 


a 1 
fal 2 6 18 
[2,] 3 Tail 
ie 4 She meal 
[4,] 5 Oe} 
> W +c # matrix + vector... : no warnings and R does the operation in a column-wise manner 

Cael yaad 
[1s] 3 Sal 
[2] 5 12 
[3,] a 9 14 
[4,] Comal 16 


"Private" variables 


A leading dot in a name of a variable or function in R is commonly used to denote that the variable or function is 
meant to be hidden. 


So, declaring the following variables 


> foo <- ‘foo' 
> .foo <- ‘bar' 


And then using the 1s function to list objects will only show the first object. 


> Is() 
al Stoo! 


However, passing all.names = TRUE to the function will show the 'private' variable 


> Is(all.names = TRUE) 
[Aly estoo: EToow 
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Chapter 3: Arithmetic Operators 


Section 3.1: Range and addition 
Let's take an example of adding a value to a range (as it could be done in a loop for example): 
STIS 
Gives: 
lilt A Sab se8 
This is because the range operator : has higher precedence than addition operator +. 
What happens during evaluation is as follows: 
© 34125 
e 3+c(1, 2, 3, 4, 5) expansion of the range operator to make a vector of integers. 
e c(4, 5, 6, 7, 8) Addition of 3 to each member of the vector. 
To avoid this behavior you have to tell the R interpreter how you want it to order the operations with (_) like this: 
(3+1):5 
Now R will compute what is inside the parentheses before expanding the range and gives: 


es 


Section 3.2: Addition and subtraction 
The basic math operations are performed mainly on numbers or on vectors (lists of numbers). 
1. Using single numbers 


We can simple enter the numbers concatenated with + for adding and - for subtracting: 


np Pa te Rpts] 


SIG Gn On) "OG 
+ 
No 


#[1] NaN 
> NaN + NA 
#[1] NaN 


We can assign the numbers to variables (constants in this case) and do the same operations: 
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a <- 3; B <- 4.5; ce <- 2; Dd <- 3.8 ;na<-NA;nan<-NaN 
at 
[1] 
at 
[1] 
at 
[1] 
> B-nan 

#[1] NaN 
> atna-na 
#[1] NA 

>a+na 
#[1] NA 

> B-nan 

#[1] NaN 
> atna-na 
#[1] NA 


cc 


cc - Dd 


+ OV Se VV SE VOU 
awnown w 
SJ On = 1 


2. Using vectors 


In this case we create vectors of numbers and do the operations using those vectors, or combinations with single 
numbers. In this case the operation is done considering each element of the vector: 


he e(3) 45. 2-38) 

A 

Lil) S20) 425) 25:0) =328 

A + 2 # Adding a number 

(ile a Ol Or 5d Ol =8 

8 - A # number less vector 

(Gh), “SeQi Seay SO. aise 

n <- length(A) #number of elements of vector A 

n 

[1] 4 

A[-n] + A[n] # Add the last element to the same vector without the last element 
[1] -@.8 0.7 -1.8 

A[1:2] + 3 # vector with the first two elements plus a number 

[AKG On 75 

A[1:2] - A[3:4] # vector with the first two elements less the vector with elements 3 and 4 
I theta) tee} 


SEV SEV SEV SHE VV SEV, SE VES VV 


We can also use the function sum to add all elements of a vector: 


sum(A) 

VAN aes? 

sum(-A) 

[1] -5.7 
sum(A[-n]) + A[n] 
Tal Sov 


SEV EE VES GV: 


We must take care with recycling, which is one of the characteristics of R, a behavior that happens when doing math 


operations where the length of vectors is different. Shorter vectors in the expression are recycled as often as need be 


(perhaps fractionally) until they match the length of the longest vector. In particular a constant is simply repeated. |n this 


case a Warning is show. 


SE Bac —G (35 570-3 Qe lo) 

> B 

we (tl Si Sie) SiC) Pal lists) 
>A 

# (dl) S30 425) 220) =358 
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> A +B # the first element of A is repeated 

#[1] 6.0 9.5 -1.0 -1.1 4.8 

Warning message: 

In A + B : longer object length is not a multiple of shorter object length 
> B - A # the first element of A is repeated 

# [1] 0 G25 =5,0 625 =1.2 

Warning message: 

In B - A : longer object length is not a multiple of shorter object length 


In this case the correct procedure will be to consider only the elements of the shorter vector: 


+ 


> B[1:n] A 
#620). 95) =1410 =1 4 
> B[1:n] - A 
# Tall) GixG! 205) 45710) “6.5 


When using the sum function, again all the elements inside the function are added. 


sum(A, B) 

al S22 
sum(A, -B) 
[1] -3.8 
sum(A)+sum(B) 
a ae 
sum(A)-sum(B) 
[1] -3.8 


$V) SEV EV SES VV 
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Chapter 4: Matrices 


Matrices store data 


Section 4.1: Creating matrices 


Under the hood, a matrix is a special kind of vector with two dimensions. Like a vector, a matrix can only have one 
data class. You can create matrices using the matrix function as shown below. 


matrix(data = 1:6, nrow = 2, ncol = 3) 
## ea 21534] 
wee lbs 1 8} 5 
## [2,] 2 4 6 


As you can see this gives us a matrix of all numbers from 1 to 6 with two rows and three columns. The data 
parameter takes a vector of values, nrow specifies the number of rows in the matrix, and ncol specifies the number 
of columns. By convention the matrix is filled by column. The default behavior can be changed with the byrow 
parameter as shown below: 


matrix(data = 1:6, nrow = 2, ncol = 3, byrow = TRUE) 
## BN eri le 
ae thal 1 7 3 
5 aa | 4 5 6 


Matrices do not have to be numeric — any vector can be transformed into a matrix. For example: 


matrix(data = c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE), nrow = 3, ncol = 2) 
Ht 8 ele 

## [1,] TRUE FALSE 

## [2,] TRUE FALSE 

## [3,] TRUE FALSE 

matrix(data = c("a", "b", "c", "d", "e", "f"), mrow = 3, ncol = 2) 

Ht [.1] [,2] 

#4 [15] va. “de 

## [2,] "b" "e" 

He OS Ca aie 


Like vectors matrices can be stored as variables and then called later. The rows and columns of a matrix can have 
names. You can look at these using the functions rownames and colnames. As shown below, the rows and columns 
don't initially have names, which is denoted by NULL. However, you can assign values to them. 


mat1 <- matrix(data = 1:6, nrow = 2, ncol = 3, byrow = TRUE) 
rownames(mat1 ) 


## NULL 

colnames(mat1) 

## NULL 

rownames(mat1) <- c("Row 1", "Row 2") 
colnames(mat1) <- c("Col 1", "Col 2", "Col 3") 
mat1 

## Comat Cole 2 Cols 

## Row 1 1 va 3 

## Row 2 4 5 6 


It is important to note that similarly to vectors, matrices can only have one data type. If you try to specify a matrix 
with multiple data types the data will be coerced to the higher order data class. 
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The class, is, and as functions can be used to check and coerce data structures in the same way they were used 
on the vectors in class 1. 


class(mat1) 

(all) mater xe: 
is.matrix(mat1) 
## [1] TRUE 
as.vector(mat1) 

cae (|| Wc ey ey 
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Chapter 5: Formula 


Section 5.1: The basics of formula 


Statistical functions in R make heavy use of the so-called Wilkinson-Rogers formula notation1 . 


When running model functions like 1m for the Linear Regressions, they need a formula. This formula specifies which 
regression coefficients shall be estimated. 


my_formulal <- formula(mpg ~ wt) 
class(my_formula1) 

# gives "formula" 

mod1 <- 1m(my_formula1, data = mtcars) 
coef (mod1) 

# gives (Intercept) wt 

# 37.285126 -5.344472 


On the left side of the ~ (LHS) the dependent variable is specified, while the right hand side (RHS) contains the 
independent variables. Technically the formula call above is redundant because the tilde-operator is an infix 
function that returns an object with formula class: 


form <- mpg ~ wt 
class(form) 
#[1] "formula" 


The advantage of the formula function over ~ is that it also allows an environment for evaluation to be specified: 
form_mt <- formula(mpg ~ wt, env = mtcars) 


In this case, the output shows that a regression coefficient for wt is estimated, as well as (per default) an intercept 
parameter. The intercept can be excluded / forced to be 0 by including @ or -1 in the formula: 


coef(1lm(mpg ~ @ + wt, data = mtcars)) 
coef(1lm(mpg ~ wt -1, data = mtcars)) 


Interactions between variables a and b can added by included a:b to the formula: 

coef(1m(mpg ~ wt:vs, data = mtcars)) 
As it is (from a statistical point of view) generally advisable not have interactions in the model without the main 
effects, the naive approach would be to expand the formula to a + b + a:b. This works but can be simplified by 


writing axb, where the * operator indicates factor crossing (when between two factor columns) or multiplication 
when one or both of the columns are 'numeric’: 


coef(1lm(mpg ~ wt*vs, data = mtcars)) 
Using the * notation expands a term to include all lower order effects, such that: 
coef(1lm(mpg ~ wt*vs*hp, data = mtcars)) 


will give, in addition to the intercept, 7 regression coefficients. One for the three-way interaction, three for the two- 
way interactions and three for the main effects. 
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If one wants, for example, to exclude the three-way interaction, but retain all two-way interactions there are two 
shorthands. First, using - we can subtract any particular term: 


coef(lm(mpg ~ wt*vs*hp - wt:vs:hp, data = mtcars)) 
Or, we can use the * notation to specify which level of interaction we require: 
coef(Im(mpg ~ (wt + vs + hp) * 2, data = mtcars)) 


Those two formula specifications should create the same model matrix. 


Finally, . is shorthand to use all available variables as main effects. In this case, the data argument is used to obtain 
the available variables (which are not on the LHS). Therefore: 


coef(lm(mpg ~ ., data = mtcars) ) 


gives coefficients for the intercept and 10 independent variables. This notation is frequently used in machine 
learning packages, where one would like to use all variables for prediction or classification. Note that the meaning 
of . depends on context (see e.g. ?update. formula for a different meaning). 


1. G.N. Wilkinson and C. E. Rogers. Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 22, No. 3 
(1973), pp. 392-399 
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Chapter 6: Reading and writing strings 
Section 6.1: Printing and displaying strings 


R has several built-in functions that can be used to print or display information, but print and cat are the most 
basic. As Ris an interpreted language, you can try these out directly in the R console: 


print("Hello World" ) 
#[1] "Hello World" 
cat("Hello World\n") 
#Hello World 


Note the difference in both input and output for the two functions. (Note: there are no quote-characters in the 
value of x created with x <- "Hello World". They are added by print at the output stage.) 


cat takes one or more character vectors as arguments and prints them to the console. If the character vector has a 
length greater than 1, arguments are separated by a space (by default): 


cat(c("hello", "world", "\n")) 
#hello world 


Without the new-line character (\n) the output would be: 


cat("Hello World") 
#Hello World> 


The prompt for the next command appears immediately after the output. (Some consoles such as RStudio's may 
automatically append a newline to strings that do not end with a newline.) 


print is an example of a "generic" function, which means the class of the first argument passed is detected and a 
class-specific method is used to output. For a character vector like “Hello World", the result is similar to the output 
of cat. However, the character string is quoted and a number [1] is output to indicate the first element of a 
character vector (In this case, the first and only element): 


print("Hello World") 
#[1] "Hello World" 


This default print method is also what we see when we simply ask R to print a variable. Note how the output of 
typing s is the same as calling print(s) or print("Hello World"): 


s <- "Hello World" 
s 
#[1] "Hello World" 


Or even without assigning it to anything: 


"Hello World" 
#[1] "Hello World" 


If we add another character string as a second element of the vector (using the ¢() function to concatenate the 
elements together), then the behavior of print() looks quite a bit different from that of cat: 


print(c("Hello World", "Here I am.")) 
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#[1] "Hello World" "Here I am." 


Observe that the c() function does not do string-concatenation. (One needs to use paste for that purpose.) R 
shows that the character vector has two elements by quoting them separately. If we have a vector long enough to 
span multiple lines, R will print the index of the element starting each line, just as it prints [1] at the start of the first 
line. 


c("Hello World", "Here I am!", "This next string is really long.") 
#[1] "Hello World" "Here I am!" 
#[3] "This next string is really long." 


The particular behavior of print depends on the class of the object passed to the function. 


If we call print an object with a different class, such as "numeric" or "logical", the quotes are omitted from the 
output to indicate we are dealing with an object that is not character class: 


print(1) 
ca 4 
print (TRUE) 
#[1] TRUE 


Factor objects get printed in the same fashion as character variables which often creates ambiguity when console 
output is used to display objects in SO question bodies. It is rare to use cat or print except in an interactive 
context. Explicitly calling print() is particularly rare (unless you wanted to suppress the appearance of the quotes 
or view an object that is returned as invisible by a function), as entering foo at the console is a shortcut for 
print(foo). The interactive console of R is known as a REPL, a "read-eval-print-loop". The cat function is best saved 
for special purposes (like writing output to an open file connection). Sometimes it is used inside functions (where 
calls to print() are suppressed), however using cat() inside a function to generate output to the console is 
bad practice. The preferred method is to message() or warning() for intermediate messages; they behave 
similarly to cat but can be optionally suppressed by the end user. The final result should simply returned so that 
the user can assign it to store it if necessary. 


message("hello world") 
#hello world 
suppressMessages(message("hello world")) 


Section 6.2: Capture output of operating system command 


Functions which return a character vector 


Base R has two functions for invoking a system command. Both require an additional parameter to capture the 
output of the system command. 


system("top -a -b -n 1", intern = TRUE) 
system2("top", "-a -b -n 1", stdout = TRUE) 


Both return a character vector. 


[1] "top - 08:52:03 up 7@ days, 15:09, @ users, load average: 0.00, 0.00, @.00" 

(Al hasks2.125)to0rall 1 running, 124 sleeping, @ stopped, @ zombie" 

[3] "Cpu(s): 0.9%US, 0.3%sy, @.0%ni, 98.7%id, 98.1%wa, @.O0%hi, @.0%si, 0.@%st" 
[4] "Mem: 12194312k total, 3613292k used, 8581020k free, 216940k buffers" 

[5] "Swap: 125829@8k total, 2334156k used, 10248752k free, 168234@k cached" 

ele 

LZ) IPED USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND : 
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[8] "11388 root 20 8 1278m 375m 3696 S 98.0 3.2 124:40.92 trala i 
[9] " 6093 user1 20s Os 1GdA7meZ6Om i868) SOO 223) e421 96uR ‘ 
[10] " 4949 user2 20 8 197m 214m 1888S 870 128 ilelG6- 73) R 


For illustration, the UNIX command top -a -b -n 1 is used. This is OS specific and may need to be 
amended to run the examples on your computer. 


Package devtools has a function to run a system command and capture the output without an additional 
parameter. It also returns a character vector. 


devtools: :system_output("top", "-a -b -n 1") 


Functions which return a data frame 


The fread function in package data.table allows to execute a shell command and to read the output like 
read.table. It returns a data.table or a data. frame. 


fread("top -a -b -n 1", check.names = TRUE) 


PID USER PR NI VIRT RES SHR S X.CPU X.MEM TIME. COMMAND 
1: 11300 root 20 @ 1278m 375m 3696 S$ O° 3-2 124240592 trala 
2) 6093 userl 20 @ 1817m 269m 1888 S Ce 2639 eli Zesi ei Oo R 
3: 4949 user2 20 @ 1917m 214m 1888 S ) ests Nd UAL Si! R 
Ae 7922. user3 20 © 3094m 131m 1892 S$ ) Al 26495 R 


Note, that fread automatically has skipped the top 6 header lines. 


Here the parameter check.names = TRUE was added to convert %CPU, “MEN, and TIME+ to syntactically 
valid column names. 


Section 6.3: Reading from or writing to a file connection 


Not always we have liberty to read from or write to a local system path. For example if R code streaming map- 
reduce must need to read and write to file connection. There can be other scenarios as well where one is going 
beyond local system and with advent of cloud and big data, this is becoming increasingly common. One of the way 
to do this is in logical sequence. 


Establish a file connection to read with file() command ("r" is for read mode): 


conn <- file("/path/example.data", "r") #when file is in local system 
conn1 <- file("stdin", "r") #when just standard input/output for files are available 


As this will establish just file connection, one can read the data from these file connections as follows: 

line <- readLines(conn, n=1, warn=FALSE) 
Here we are reading the data from file connection conn line by line as n=1. one can change value of n (say 10, 20 
etc.) for reading data blocks for faster reading (10 or 20 lines block read in one go). To read complete file in one go 


set n=-1. 


After data processing or say model execution; one can write the results back to file connection using many different 
commands like writeLines(),cat() etc. which are capable of writing to a file connection. However all of these 
commands will leverage file connection established for writing. This could be done using file() command as: 
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conn2 <- file("/path/result.data", "w") #when file is in local system 
conn3 <- file("stdout", "w") #when just standard input/output for files are available 


Then write the data as follows: 


writeLines("text",conn2, sep = "\n") 
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Chapter 7: String manipulation with stringi 
package 
Section 7.1: Count pattern inside string 


With fixed pattern 


stri_count_fixed("babab", "b") 
ee ill) 3 
stri_count_fixed("babab", "ba") 
am ul 2 
stri_count_fixed("babab", "bab") 
Follelien 


Natively: 


length(gregexpr("b", "babab")[[1]]) 
Heals 
length(gregexpr( "ba", "babab")[[1]]) 
Be (abl 
length(gregexpr( "bab", "babab")[[1]]) 
onli 


function is vectorized over string and pattern: 
stri_count_fixed("babab", e¢("b","ba")) 
ce eee 


stri_count_fixed(c("babab", "bbb", "bca","abc"), c("b", "ba")) 
#1320) 1d 


A base R solution: 
sapply(c("b", "ba"), function(x)length(gregexpr(x, "babab")[[1]])) 


#b ba 
Heo o2) 


With regex 
First example - find a and any character after 


Second example - find a and any digit after 


stri_count_regex("a1 b2 a3 b4 aa", "a.") 

ca AL) Ss) 

stri_count_regex("al b2 a3 b4 aa", "a\\d") 
eWay 2 


Section 7.2: Duplicating strings 


stri_dup("abc",3) 
# [1] "“abcabcabc" 


A base R solution that does the same would look like this: 
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paste@(rep("abc",3),collapse = "") 
# [1] "“abcabcabc" 


Section 7.3: Paste vectors 


stri_paste(LETTERS, "-", 1:13) 
# fal Wy sg pe peo) vieC=au DE AW "E-5" 
# [14] "NE NOE De Upagu "Q-4" UR Si 


Natively, we could do this in R via: 
> paste(LETTERS, 1:13, sep="-" 


) 
#[1] ny Mees sy] oo Wier WD} 
#[14] LUNs VO=o4 pes NK) 


=f) NE 
ae 


"Eg" "G7" 
"o.6" "7-7" 


"Hog" 


"Y-g" 


"T-9" 


"yg" 


"3-10" "K-17" "L-12" "M-13" 
"W-10" "X-17" "Y-12" "7-13" 


Section 7.4: Splitting text by some fixed pattern 


Split vector of texts using one pattern: 


stri_split_fixed(c("To be or not to be." 


[[1]] 


Lan] Us Fra “ibe Wreles nO tee Ue ole “ibe 


[[2]] 


fa oThaSs” Waite "very" 


# RHR HH 


Split one text using many patterns: 


stri_split_fixed("Apples, oranges and pineaplles.",c( 
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“This is very short sentence.")," ") 


TShON ts "sentence." 


# [[1]] 

# [1] "Apples," “oranges” "and" “pineaplles." 
# 

# ((2]] 

# [1] "Apples" "oranges and pineaplles." 

# 

# [[3]] 

# [1] "Apple" "| orange" "and pineaplle" "." 


)) 
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Chapter 8: Classes 


The class of a data-object determines which functions will process its contents. The class-attribute is a character 
vector, and objects can have zero, one or more classes. If there is no class-attribute, there will still be an implicit 
class determined by an object's mode. The class can be inspected with the function class and it can be set or 
modified by the class<- function. The S3 class system was established early in S's history. The more complex S4 
class system was established later 


Section 8.1: Inspect classes 


Every object in Ris assigned a class. You can use class() to find the object's class and str() to see its structure, 
including the classes it contains. For example: 


class(iris) 
[1] "data.frame" 


str(iris) 
‘data.frame': 15@ obs. of 5 variables: 
S$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 
Si SepalleWidthes num 345) 393.2 3.13.6 3.9) 9.43740 220 374 
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.71.4 1.5 1.4 1.5 
S$ Petal.Width : num @.2 0.2 0.2 @.2 0.2 @.4 0.3 @.2 0.2 0.1... 
S$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1111131414141 
class(iris$Species) 
hu atacton: 


We see that iris has the class data. frame and using str() allows us to examine the data inside. The variable 
Species in the iris data frame is of class factor, in contrast to the other variables which are of class numeric. The 
str() function also provides the length of the variables and shows the first couple of observations, while the 
class() function only provides the object's class. 


Section 8.2: Vectors and lists 


Data in R are stored in vectors. A typical vector is a sequence of values all having the same storage mode (e.g., 


characters vectors, numeric vectors). See ?atomic for details on the atomic implicit classes and their corresponding 


storage modes: "logical", “integer”, "numeric" (synonym “double"), “complex", “character” and “raw”. 


Many classes are simply an atomic vector with a class attribute on top: 


X <- 1826 
class(x) <- "Date" 
x 


# [1] "1975-01-01" 
x <- as.Date("1970-01-01") 
class(x) 

#[1] "Date" 
is(x, "Date") 

#[1] TRUE 
is(x, "integer" ) 

#[1] FALSE 
is(x, "numeric" 

#[1] FALSE 

mode(x) 
#[1] "numeric" 
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Lists are a special type of vector where each element can be anything, even another list, hence the R term for lists: 


"recursive vectors": 


mylist <- list( A = c(5,6,7,8), B = letters[1:10], CC = list( 5, "Z") ) 


Lists have two very important uses: 


e Since functions can only return a single value, it is common to return complicated results in a list: 


f <- function(x) list(xplus = x + 10, xsq = x2) 


(7) 

# Sxplus 
=? | ze 
# 

# Sxsq 

# [1] 49 


e Lists are also the underlying fundamental class for data frames. Under the hood, a data frame is a list of 
vectors all having the same length: 


L <= Mist(x = 1:2, y = ¢("A","B")) 
DF <- data. frame(L) 


1 
Z: 
is. list (DF) 
# [1] TRUE 


The other class of recursive vectors is R expressions, which are "language"- objects 


Section 8.3: Vectors 


The most simple data structure available in R is a vector. You can make vectors of numeric values, logical values, 
and character strings using the c() function. For example: 


Geen 3) 

## [1] 1 2 3 

c(TRUE, TRUE, FALSE) 

## [1] TRUE TRUE FALSE 
CG eed Die eau) 

#4 (1) say obi se" 


You can also join to vectors using the c() function. 
Xone (ilipe 2 ee)) 

y =— (3) 4, 6) 

Z => (x,y) 


Z 
HAS dl 2 5n3" 456 


A more elaborate treatment of how to create vectors can be found in the "Creating vectors" topic 
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Chapter 9: Lists 


Section 9.1: Introduction to lists 


Lists allow users to store multiple elements (like vectors and matrices) under a single object. You can use the list 


function to create a list: 


lec ist (C(x 2 Saale eibia Cis))) 


11 

## [[1]] 

## [1] 12 3 
#4 

## [[2]] 


He [| au siby Ker 


Notice the vectors that make up the above list are different classes. Lists allow users to group elements of different 


classes. Each element in a list can also have a name. List names are accessed by the names function, and are 
assigned in the same manner row and column names are assigned in a matrix. 


names(11) 

## NULL 

names(11) <- ec("vector1", "vector2") 
11 

## Svector1 

eae lei ih 2 SI 

## 

## Svector2 

cae |) Mey? by BoM 


It is often easier and safer to declare the list names when creating the list object. 


I2s<s— ist(vec = c(i, 3,5, 7, 79), 
mat = matrix(data = c(1, 2, 3), nrow = 3)) 


12 

## Svec 

cee Mill a) eh ey 
## 

## Smat 

## fal 

ae (Ullal 1 

cae eq ll 2: 

## [3,] 3 
names (12) 


## [i] Yvec"” imate 


Above the list has two elements, named "vec" and "mat," a vector and matrix, resepcively. 


Section 9.2: Quick Introduction to Lists 


In general, most of the objects you would interact with as a user would tend to be a vector; e.g numeric vector, 
logical vector. These objects can only take in a single type of variable (a numeric vector can only have numbers 
inside it). 


A list would be able to store any type variable in it, making it to the generic object that can store any type of 
variables we would need. 
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Example of initializing a list 


exampleList1 <- list('a', ‘b') 
exampleList2 <- list(1, 2) 
exampleList3 <- list('a', 1, 2) 


In order to understand the data that was defined in the list, we can use the str function. 


str(exampleList1) 
str(exampleList2) 
str(exampleList3) 


Subsetting of lists distinguishes between extracting a slice of the list, i.e. obtaining a list containing a subset of the 


elements in the original list, and extracting a single element. Using the [ operator commonly used for vectors 
produces a new list. 


# Returns List 


exampleList3[1] 
exampleList3[1:2] 


To obtain a single element use [[ instead. 


# Returns Character 
exampleList3[[1]] 


List entries may be named: 


exampleList4 <- list( 
num = 1:3, 
numeric = @.5, 
chat = ¢(@a 7 7b.) 


The entries in named lists can be accessed by their name instead of their index. 


exampleList4[['char']] 


Alternatively the $ operator can be used to access named elements. 


exampleList4Snum 


This has the advantage that it is faster to type and may be easier to read but it is important to be aware of a 
potential pitfall. The $ operator uses partial matching to identify matching list elements and may produce 
unexpected results. 


exampleList5 <- exampleList4[2:3] 


exampleList4Snum 
Hoc (ile 2 S)) 


exampleList5Snum 
# 0.5 


exampleList5[['‘num' ]] 
# NULL 
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Lists can be particularly useful because they can store objects of different lengths and of various classes. 


## Numeric vector 


exampleVector1 <- c(12, 13, 14) 
## Character vector 


exampleVector2 <- c("a", "b", 


## Matrix 


"a" 


"#") 


exampleMatrix1 <- matrix(rnorm(4), ncol = 2, nrow 


## List 


exampleList3 <- list('a', 1, 2) 


exampleList6 <- list( 
num = exampleVector1, 
char = exampleVector2, 
mat = exampleMatrix1, 
list = exampleList3 


) 

exampleList6 
#Snum 

HA 12) 13 14: 
# 

#$char 

ay Pee Bye 
# 

#Smat 

# [eat] 


nomga 


Hen "gM 


[2] 


#[1,] @.501305@ -1.88801542 
@.09751379 


#[2,] @.4295266 
# 

#S$list 
#Slist[[1]] 
ala) vey 

# 
#Slist[[2]] 
#11] 1 

# 
#Slist[[3]] 
Alita v2 


Section 9.3: Serialization: using lists to pass information 


There exist cases in which it is necessary to put data of different types together. In Azure ML for example, it is 


2) 


necessary to pass information from a R script module to another one exclusively throught dataframes. Suppose we 
have a dataframe and a number: 


> df 

name height 
1 Andrea 195 
2 Paja 165 
3 Roro 190 
4 Gioele 70 
5 Cacio 170 
6 Edola FA! 
7 Salami 175 
8 Braugo 180 
9 Benna 158 
10 Riggio 182 
11 Giordano 185 
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team fun_index title age 


Lazio 
Fiorentina 
Lazio 
Lazio 
Juventus 
Lazio 
Inter 
Inter 
Juventus 
Lazio 
Roma 


oF 
87 
65 
180 
81 
D2, 
The) 
a) 
80 
92. 
ao 


6 


aaa»wnwinndwoaedeana 


33 
31 
28 


desc 
eccellente 
deciso 
strano 
simpatico 
duro 
svampito 
doppiopasso 
gjn 
esaurito 
certezza 
buono 


= —23 0 0 - -$ @O- oO- - ~< 
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> number <- "42" 
We can access to this information: 


> paste(dfSname[4], "is a",df3Steam[4], "supporter." ) 
[1] "Gioele is a Lazio supporter." 

> paste("The answer to THE question is", number ) 

[1] "The answer to THE question is 42" 


In order to put different types of data in a dataframe we have to use the list object and the serialization. In 
particular we have to put the data in a generic list and then put the list in a particular dataframe: 


1 <- list(df, number ) 


dataframe_container <- data.frame(out2 = as.integer(serialize(1, connection=NULL) ) ) 


Once we have stored the information in the dataframe, we need to deserialize it in order to use it: 


#----- unserialize ---------------------------------------- + 
unser obj <- unserialize(as.raw(dataframe container$outz2) ) 
#----- taking back the elements---------------------------- + 
df_mod <- unser _obj[1][[1]] 

number_mod <- unser _obj[2][[1]] 


Then, we can verify that the data are transfered correctly: 


> paste(df_modSname[4], “is a",df_modSteam[4], "supporter." ) 
[1] "Gioele is a Lazio supporter." 

> paste("The answer to THE question is", number_mod ) 

[1] "The answer to THE question is 42" 
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Chapter 10: Hashmaps 


Section 10.1: Environments as hash maps 


Note: in the subsequent passages, the terms hash map and hash table are used interchangeably and refer to the same 
concept, namely, a data structure providing efficient key lookup through use of an internal hash function. 


Introduction 


Although R does not provide a native hash table structure, similar functionality can be achieved by leveraging the 
fact that the environment object returned from new. env (by default) provides hashed key lookups. The following 
two statements are equivalent, as the hash parameter defaults to TRUE: 


H <- new.env(hash = TRUE) 
H <- new.env() 


Additionally, one may specify that the internal hash table is pre-allocated with a particular size via the size 
parameter, which has a default value of 29. Like all other R objects, environments manage their own memory and 
will grow in capacity as needed, so while it is not necessary to request a non-default value for size, there may bea 
slight performance advantage in doing so if the object will (eventually) contain a very large number of elements. It 
is worth noting that allocating extra space via size does not, in itself, result in an object with a larger memory 
footprint: 


object.size(new.env()) 
# 56 bytes 


object.size(new.env(size = 10e4)) 
# 56 bytes 


Insertion 


Insertion of elements may be done using either of the [[<- or $<- methods provided for the environment class, but 
not by using "single bracket" assignment ([<-): 


H <- new.env() 


H[["key"]] <- rnorm(1) 


key2 <- "xyz 
H[[key2]] <- data.frame(x = 1:3, y = letters[1:3]) 


HSanother_key <- matrix(rbinom(9, 1, 9.5) > 8, nrow = 3) 
H["error'] <- 42 


#Error in H["error"] <- 42 : 
# object of type ‘environment' is not subsettable 


Like other facets of R, the first method (object[[key]] <- value) is generally preferred to the second (objectSkey 
<- value) because in the former case, a variable maybe be used instead of a literal value (e.g key2 in the example 
above). 


As is generally the case with hash map implementations, the environment object will not store duplicate keys. 
Attempting to insert a key-value pair for an existing key will replace the previously stored value: 


Goalkicker.com - R Notes for Professionals 29 


H[["key3"]] <- “original value’ 
H[["key3"]] <- "new value" 


H[["key3"]] 
#[1] "new value" 


Key Lookup 


Likewise, elements may be accessed with [[ or $, but not with [: 


H[["key"]] 
#[1] 1.630631 
[[key2]]  ## assuming key2 <- "xyz" 


H 
# x y 
# 
# 
# 


HSanother_key 

# tera) Male Sle esl 
# [1,] TRUE TRUE TRUE 
# [2,] FALSE FALSE FALSE 
# [3 ]/ TRUE TRUE TRUE 


H[1] 
#Error in H[1] : object of type ‘environment’ is not subsettable 


Inspecting the Hash Map 


Being just an ordinary environment, the hash map can be inspected by typical means: 


names (H) 

#[1] "“another_key" "xyz" "key" "key3" 
1s(H) 

#[1] "“another_key" "key" "key3" XY Ze 
str(H) 


#<environment: @x7828228> 


1ls.str(H) 

# another_key : logi [1:3, 1:3] TRUE FALSE TRUE TRUE FALSE TRUE ... 
# key : num 1.63 

# key3 : chr "new value" 

# xyz : ‘data.frame': 3 obs. of 2 variables: 

ba Seo Sifpes gj: V2 SI 

Be SAGE elie ele oe Kes! 


Elements can be removed using rm: 


rm(list = c("key", "key3"), envir = H) 


1ls.str(H) 
# another_key : logi [1:3, 1:3] TRUE FALSE TRUE TRUE FALSE TRUE ... 
# xyz : ‘data.frame': 3 obs. of 2 variables: 


a See Ginsu) 22s) 
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# S VE chr al! Wh Neu 


Flexibility 


One of the major benefits of using environment objects as hash tables is their ability to store virtually any type of 
object as a value, even other environments: 


H2 <- new.env() 


["a"]] <- LETTERS 

["b"]] <- as.list(x = 1:5, y = matrix(rnorm(1@), 2)) 
H2[["c"]] <- head(mtcars, 3) 

["d"]] <- Sys.Date() 

["e"]] <- Sys.time() 
H2[["f"]] <- (function() { 

H3 <- new.env() 

for (i in seq_along(names(H2))) { 

H3[ [names(H2)[i]]] <- H2[[names(H2) [i] ]] 

} 

H3 
HO 


1ls.str(H2) 
a elmie ISAS) PVN ME GR RD a eee ne ee ae ee 
bo Ersit of 5 
anit: 
2 anit 
Seeanit 
lint 
= llghe 
‘data.frame': 3 obs. of 11 variables: 
npge sinuM Ze 22238 
cyl : num 664 
disp: num 160 168 108 
hps = num) 1: AN1er 93 
dratinumm 3:29)33 93585 
Wits es IMUM 202 22.08 27.32) 
qsecs num 16,5) 17 11826 
VS= 2 num 
am : num 
gear: num 
carb: num 4 4 1 
Date[1:1], format: "2016-08-03" 
POSIXct[1:1], format: "2016-08-03 19:25:14" 
: <environment: @x91a7cb8> 


MHD Wd 
oR WhDNM 


i?) 


k- © 


KR = 


HR HH HH HHH HH HHH HHH HH H HH HK 
MAOAMANANANANANUNND 
fh - © 


Saree Wek 


1ls.str(H2$f) 
a Chie (MeZod VAS Be Ce De Ee Ee Ge aster y ey Ka 
bee asi votes 
2 anit 

galiqhe 
a allah 
aenie 
allah 
c : ‘data.frame': 3 obs. of 11 variables: 

Simpgy: anum) 2121-228 

S cyl : num 6 6 4 

$ disp: num 168 168 108 

S hp : num 118 118 93 

Sdigates anumn 32934903185 

Sew, | 2 num: =2)62 2.88) 2.32) 


DMM DW 
oR WD 


He oR HR HH HH HHH HH HH HK 
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# $ qsec: num 16.5 17 18.6 

# Svs : num @@1 

a Seu Font Wea 

# $ gear: num 444 

# oS) canbe mum .4 4 

#d =: Date[1:1], format: "2016-08-03" 

#e : POSIXct[1:1], format: "2016-98-03 19:25:14" 
Limitations 


One of the major limitations of using environment objects as hash maps is that, unlike many aspects of R, 
vectorization is not supported for element lookup / insertion: 


names (H2) 
Fetal DE eeu CHa uC: aabe\s malcfi 


H2[[c("a", eb yal) 
FEnron an H2ife(C az, be) Ih) 
# wrong arguments for subsetting an environment 


Keys <- c(sa, “by 2) 
H2[ [Keys] ] 
#Error in H2[[Keys]] : wrong arguments for subsetting an environment 


Depending on the nature of the data being stored in the object, it may be possible to use vapply or list2env for 
assigning many elements at once: 


E1 <- new.env() 
invisible( { 
vapply(letters, function(x) { 
E1[[x]] <- rnorm(1) 
logical(@) 
}, FUN.VALUE = logical(@) ) 
7 


all.equal(sort(names(E1)), letters) 
#[1] TRUE 


Keys <- letters 
E2 <- list2env( 
setNames( 
as. list(rnorm(26)), 
nm = Keys), 
envir = NULL, 
hash = TRUE 
) 


all.equal(sort(names(E2)), letters) 
#[1] TRUE 


Neither of the above are particularly concise, but may be preferable to using a for loop, etc. when the number of 
key-value pairs is large. 


Section 10.2: package:hash 


The hash package offers a hash structure in R. However, it terms of timing for both inserts and reads it compares 
unfavorably to using environments as a hash. This documentation simply acknowledges its existence and provides 
sample timing code below for the above stated reasons. There is no identified case where hash is an appropriate 
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solution in R code today. 


Consider: 


# Generic unique string generator 
unique_strings <- function(n) { 
string_i <- 1 
string_len <- 1 
ans <- character(n) 
chars <- c(letters, LETTERS) 
new_strings <- function(len, pfx) { 
for(i in 1:length(chars) ) { 
if (len == 1){ 
ans[string_i] <<- paste(pfx,chars[i],sep='') 
string_i <<- string_i + 1 


} else { 
new_strings(len-1,pfx=paste(pfx,chars[i], sep='')) 
} 
if (string_i > n) return () 

} 

} 


while(string_i <= n){ 
new_strings(string_len, '') 
string_len <- string_len + 1 
} 

sample(ans) 


} 


# Generate timings using an enviornment 
timingsEnv <- plyr::adply(2*(10:15), .mar=1, .fun=function(i) { 
strings <- unique_strings(i) 
ht1 <- new.env(hash=TRUE) 
lapply(strings, function(s){ ht1[[s]] <<- @L}) 
data. frame( 
size=c(i,i), 
seconds=c( 
system.time(for (j in 1:i) ht1[[strings[j]]]==90L)[3]), 
type = c('1_hashedEnv' ) 
) 
}) 


timingsHash <- plyr::adply(2‘(10:15), .mar=1, .fun=function(i) { 
strings <- unique_strings(i) 
ht <- hash: :hash() 
lapply(strings, function(s) ht[[s]] <<- @L) 
data. frame( 
size=c(i,i), 
seconds=c( 
system.time(for (j in 1:i) ht[[strings[j]]]==@L)[3]), 
type = c('3_stringHash' ) 
) 
}) 


Section 10.3: package:listenv 


Although package: listenv implements a list-like interface to environments, its performance relative to 
environments for hash-like purposes is poor on hash retrieval. However, if the indexes are numeric, it can be quite 
fast on retrieval. However, they have other advantages, e.g. compatibility with package : future. Covering this 
package for that purpose goes beyond the scope of the current topic. However, the timing code provided here can 
be used in conjunction with the example for package:hash for write timings. 
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timingsListEnv <- plyr::adply(2*(10:15), .mar=1, .fun=function(i) { 
strings <- unique_strings(i) 
le <- listenv: :listenv() 
lapply(strings, function(s) le[[s]] <<- @L) 
data. frame( 
size=c(i,i), 
seconds=c( 
system.time(for (k in 1:i) le[[k]]==@L)[3]), 
type = c('2_numericListEnv' ) 
) 
}) 
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Chapter 11: Creating vectors 


Section 11.1: Vectors from build in constants: Sequences of 


letters & month names 
R has a number of build in constants. The following constants are available: 


e LETTERS: the 26 upper-case letters of the Roman alphabet 

e letters: the 26 lower-case letters of the Roman alphabet 

¢ month. abb: the three-letter abbreviations for the English month names 
¢ month.name: the English names for the months of the year 

e pi: the ratio of the circumference of a circle to its diameter 


From the letters and month constants, vectors can be created. 
1) Sequences of letters: 


> letters 


[1] Beale Boye eu Uys |e neu ee Teale ate eine ii alc Bs ome wate Wow apa " " " 


yh hat 


> LETTERS[7:9] 
[a we! et eae 


> letters[c(1,5,3,2,4)] 
[Ti Meat sere teu ebis idle 


2) Sequences of month abbreviations or month names: 


> month.abb 


ati "ani "Feb" "Mar" VAT "May" ANT ale Sud" "Aug" "Sep" LOGE! " 


> month.name[1:4] 
[1] "January" "February" "March" "April" 


> month.abb[c(3,6,9,12) ] 
[1] "Mar" wets "Sep" "Dec" 


Section 11.2: Creating named vectors 
Named vector can be created in several ways. With c: 
XCc= led) =) 5-s 1D) oe Cue) 7d 7—28)) 


which results in: 


with list: 
xe elist(ay —)5, be = 6. se. =) 7. .ed: —"s:) 
which results in: 


GoalKicker.com - R Notes for Professionals 


Nov" 


"Dec" 


55 


> xl 
Sa 
its 


$b 
[1] 6 


Sc 
Paez 


$d 
[1] 8 


With the setNames function, two vectors of the same length can be used to create a named vector: 


Ne Se SIRI e) 
y <- letters[1:4] 


xy <- setNames(x, y) 


which results in a named integer vector: 


As can be seen, this gives the same result as the c method. 
You may also use the names function to get the same result: 


xy <- 5:8 
names(xy) <- letters[1:4] 


With such a vector it is also possibly to select elements by name: 


> xy["c"] 
c 
7 


This feature makes it possible to use such a named vector as a look-up vector/table to match the values to values of 
another vector or column in dataframe. Considering the following dataframe: 


mydf <- data.frame(let = c('c','a','b','d')) 


> mydf 
let 


WON 
ommown’ mar) 


Suppose you want to create a new variable in the mydf dataframe called num with the correct values from xy in the 
rows. Using the match function the appropriate values from xy can be selected: 


mydfSnum <- xy[match(mydfSlet, names(xy)) ] 
which results in: 
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Section 11.3: Sequence of numbers 


Use the : operator to create sequences of numbers, such as for use in vectorizing larger chunks of your code: 
XG 15 

Xx 

cee lil ee eye 


This works both ways 


10:4 
HGR 2876 25 A 


and even with floating point numbers 


12535 
AA 1259 222503)225) 4325 


or negatives 


-4:4 
Fld =4=3) =? =) O23) 4 


Section 11.4: seqQd 
seq is a more flexible function than the : operator allowing to specify steps other than 1. 
The function creates a sequence from the start (default is 1) to the end including that number. 


You can supply only the end (to) parameter 


seq(5) 
2) V2 34% 


As well as the start 


seq(2, 5) # or seq(from=2, to=5) 
ef ili) 2s} ae 


And finally the step (by) 


seq(2, 5, @.5) # or seq(from=2, to=5, by=@.5) 
ee Ll) Aaa) oS. See) ss Zhe) SoG) 


seq can optionally infer the (evenly spaced) steps when alternatively the desired length of the output (length. out) 
is supplied 


seq(2,5, length.out = 10) 
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A 250) 2382 O20) 302 32 5a3.8 Ae aA rAe 75 .O 


If the sequence needs to have the same length as another vector we can use the along.with as a shorthand for 
length.out = length(x) 


xX = 1:8 
seq(2,5,along.with = x) 
# [1] 2.000000 2.428571 2.857143 3.285714 3.714286 4.142857 4.571429 5.990000 


There are two useful simplified functions in the seq family: seq_along, seq_len, and seq. int. seq_along and 
seq_len functions construct the natural (counting) numbers from 1 through N where N is determined by the 
function argument, the length of a vector or list with seq_along, and the integer argument with seq_len. 


seq_along(x) 
ele 223) AS51607. 08 


Note that seq_along returns the indices of an existing object. 


# counting numbers 1 through 10 

seq_len(10) 

Raul) al PA SP Sy Gy af 7 sh = ae) 08) 

# indices of existing vector (or list) with seq_along 
letters[1:10] 

Dp ESA oface i ctiina afa eMac ton as glare auto ei kenya pues 8 [ee 
seq_along(letters[1:10]) 

CU a ee RE Ey ey yh ashe ey Ns) 


seq. intis the same as seq maintained for ancient compatibility. 
There is also an old function sequencethat creates a vector of sequences from a non negative argument. 


sequence(4) 

Fe alll ale dt eS. 
sequence(c(3, 2)) 

oe MW ee a 
sequence(c(3, 2, 5)) 

Fea [iil | ail ee Pee, we? ale Dae 


Section 11.5: Vectors 


Vectors in R can have different types (e.g. integer, logical, character). The most general way of defining a vector is by 
using the function vector(). 


vector('integer',2) # creates a vector of integers of size 2. 
vector('character',2) # creates a vector of characters of size 2. 
vector('logical',2) # creates a vector of logicals of size 2. 


However, in R, the shorthand functions are generally more popular. 


integer(2) # is the same as vector('integer',2) and creates an integer vector with two elements 
character(2) # is the same as vector('integer',2) and creates an character vector with two elements 
logical(2) # is the same as vector('logical',2) and creates an logical vector with two elements 


Creating vectors with values, other than the default values, is also possible. Often the function ¢() is used for this. 
The c is short for combine or concatenate. 


Goalkicker.com - R Notes for Professionals 38 


c(1, 2) # creates a integer vector of two elements: 1 and 2. 
c('a', 'b') # creates a character vector of two elements: a and b. 
c(T,F) # creates a logical vector of two elements: TRUE and FALSE. 


Important to note here is that R interprets any integer (e.g. 1) as an integer vector of size one. The same holds for 
numerics (e.g. 1.1), logicals (e.g. T or F), or characters (e.g. 'a'). Therefore, you are in essence combining vectors, 
which in turn are vectors. 


Pay attention that you always have to combine similar vectors. Otherwise, R will try to convert the vectors in vectors 
of the same type. 


c(1,1.1,'a',T) # all types (integer, numeric, character and logical) are converted to the 'lowest' 
type which is character. 


Finding elements in vectors can be done with the [ operator. 
vec_int <- c(1,2,3) 
vec_char <- c('a','b','c') 


vec_int[2] # accessing the second element will return 2 
vec_char[2] # accessing the second element will return 'b' 


This can also be used to change values 


vec_int[2] <- 5 # change the second value from 2 to 5 
vec_int # returns [1] 1 5 3 


Finally, the : operator (short for the function seq()) can be used to quickly create a vector of numbers. 


vec_int <- 1:10 
vec_int # returns [1] 123456789 10 


This can also be used to subset vectors (from easy to more complex subsets) 


vec_char <- e('a','b','c','d','e') 
vec_char[2:4] # returns [1] "b" "c" "“d" 
vec_char[ce(1,3,5)] # returns [1] "a" "c" "e" 


Section 11.6: Expanding a vector with the repQ function 
The rep function can be used to repeat a vector in a fairly flexible manner. 


# repeat counting numbers, 1 through 5 twice 
rep(1:5, 2) 
alee a Sea ae 2s eas 


# repeat vector with incomplete recycling 
rep(1:5, 2, length.out=7) 
DA) ee eee Gy a) 


The each argument is especially useful for expanding a vector of statistics of observational/experimental units into 
a vector of data.frame with repeated observations of these units. 


# same except repeat each integer next to each other 
rep(1:5, each=2) 
Du] A a ee ee eh ee ey 
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A nice feature of rep regarding involving expansion to such a data structure is that expansion of a vector to an 


unbalanced panel can be accomplished by replacing the length argument with a vector that dictates the number of 
times to repeat each element in the vector: 


# automated length repetition 
rep(1:5, 1:5) 

[A 2523) 23 340 4A 4 5 5S 55 
# hand-fed repetition length vector 
rep(i<5,, (4), 1172, 2) 

[lt 284542575 


This should expose the possibility of allowing an external function to feed the second argument of rep in order to 
dynamically construct a vector that expands according to the data. 


As with seq, faster, simplified versions of rep are rep_len and rep.int. These drop some attributes that rep 


maintains and so may be most useful in situations where speed is a concern and additional aspects of the repeated 
vector are unnecessary. 


# repeat counting numbers, 1 through 5 twice 
rep.int(1:5, 2) 
[ie 2eSeA a Se le2 gece 


# repeat vector with incomplete recycling 


rep_len(1:5, length.out=7) 
Pid] a2 eh ete) 
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Chapter 12: Date and Time 


R comes with classes for dates, date-times and time differences; see ?Dates, 7DateTimeClasses, ?difftime and 


follow the "See Also" section of those docs for further documentation. Related Docs: Dates and Date-Time Classes. 


Section 12.1: Current Date and Time 

Ris able to access the current date, time and time zone: 

Sys .Date() # Returns date as a Date object 

## [1] "2016-07-21" 

Sys.time() # Returns date & time at current locale as a POSIXct object 
## [1] "2016-07-21 10:04:39 CDT" 

as.numeric(Sys.time()) # Seconds from UNIX Epoch (1970-01-01 00:00:00 UTC) 

## [1] 1469113479 

Sys .timezone() # Time zone at current location 


## [1] "Australia/Melbourne" 
Use OlsonNames() to view the time zone names in Olson/IANA database on the current system: 


str(OlsonNames() ) 
## chr [1:589] "Africa/Abidjan" "“Africa/Accra" "Africa/Addis_Ababa" "Africa/Algiers" 
"Africa/Asmara" "Africa/Asmera" "Africa/Bamako" 


Section 12.2: Go to the End of the Month 


Let's say we want to go to the last day of the month, this function will help on it: 


eom <- function(x, p=as.POSIX1t(x)) as.Date(modifyList(p, list(mon=pSmon + 1, mday=@) )) 


Test: 


xX <- seq(as.POSIXct ("2000-12-10") ,as.POSIXct("2001-05-10"), by="months") 
> data. frame(before=x,after=eom(x) ) 
before after 
1 2000-12-10 2000-12-31 
2 2001-01-10 2001-01-31 
3 2001-02-10 2001-02-28 
4 2001-03-10 2001-03-31 
5 2001-04-10 2001-04-30 
6 2001-05-10 2001-05-31 
> 


Using a date in a string format: 


> eom('2000-01-81' ) 
[1] "20¢0-e1-31" 
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Section 12.3: Go to First Day of the Month 
Let's say we want to go to the first day of a given month: 


date <- as.Date("2017-01-20") 


> as.POSIX1t(cut(date, "month")) 
[1] "2017-81-01 EST" 


Section 12.4: Move a date a number of months consistently by 
months 


Let's say we want to move a given date a numof months. We can define the following function, that uses the mondate 
package: 


moveNumOfMonths <- function(date, num) { 
as.Date(mondate(date) + num) 


} 


It moves consistently the month part of the date and adjusting the day, in case the date refers to the last day of the 
month. 


For example: 


Back one month: 


> moveNumOfMonths( "2017-18-30", -1) 
[1] "2017-89-30" 


Back two months: 


> moveNumOfMonths( "2017-18-30", -2) 
[1] "2017-88-30" 


Forward two months: 


> moveNumOfMonths("2017-@2-28", 2) 
[1] "2017-04-30" 


It moves two months from the last day of February, therefore the last day of April. 
Let's se how it works for backward and forward operations when it is the last day of the month: 


> moveNumOfMonths("2016-11-30", 2) 
[i “ee17-e71-31" 
> moveNumOfMonths("2017-91-31", -2) 
[1] "2016-11-30" 


Because November has 30 days, we get the same date in the backward operation, but: 


> moveNumOfMonths("2017-01-30", -2) 
[1] "2016-11-30" 
> moveNumOfMonths( "2016-11-30", 2) 
[1] "2017-81-31" 
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Because January has 31 days, then moving two months from last day of November will get the last day of January. 


Goalkicker.com - R Notes for Professionals 43 


Chapter 13: The Date class 


Section 13.1: Formatting Dates 


To format Dates we use the format(date, format="%Y-%m-%d" ) function with either the POSIXct (given from 
as .POSIXct()) or POSIX1t (given from as.POSIX1t()) 


d = as.Date("2016-@7-21") # Current Date Time Stamp 


format(d, "%a") # Abbreviated Weekday 
2H |) iho 
format(d, "%A") # Full Weekday 


## [1] "Thursday" 


format(d, "%b") # Abbreviated Month 
ae Mall) eelunkY 

format(d,"%B") # Full Month 

## [1] "July" 

format(d, "%m" ) # @@-12 Month Format 
aim lal) Gee 

format(d, "%d") # @0-31 Day Format 
Fle wlio 

format(d, "%e") # @-31 Day Format 
cee All| Set 

format(d, "%y") # @0-99 Year 

cae [lll Wales 

format(d, "%Y") # Year with Century 
## [1] "2016" 


For more, see ?strptime. 


Section 13.2: Parsing Strings into Date Objects 


R contains a Date class, which is created with as.Date(), which takes a string or vector of strings, and if the date is 


not in ISO 8601 date format YYYY-MM-DD, a formatting string of strptime-style tokens. 


as.Date('2016-08-01') # in ISO format, so does not require formatting string 
## [1] "2016-08-01" 


as.Date('@5/23/16', format = '%m/%d/%y' ) 
## [1] "2016-05-23" 


as.Date('March 23rd, 2016', '%B %drd, %Y') # add separators and literals to format 
## [1] "2016-03-23" 


as.Date(' 2016-@8-@1  foo') # leading whitespace and all trailing characters are ignored 
## [1] "2016-08-01" 


as.Date(c('2016-01-01', '2016-0@1-@2')) 
# [1] "2016-01-01" "2016-01-02" 
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Section 13.3: Dates 


To coerce a variable to a date use the as.Date() function. 


> x <- as.Date("2016-8-23") 
> xX 

[1] "2016-88-23" 

> class(x) 

[1] "Date" 


The as.Date() function allows you to provide a format argument. The default is %Y-%m-%d, which is Year-month- 
day. 


> as.Date("23-8-2016", format="%d-%m-%Y") # To read in an European-style date 
[1] "2016-88-23" 


The format string can be placed either within a pair of single quotes or double quotes. Dates are usually expressed 
in a variety of forms such as: "d-m-yy" or "d-m-YYYY" or "m-d-yy" or "m-d-YYYY" or "YYYY-m-d" or "YYYY-d-m". 
These formats can also be expressed by replacing "-" by "/". Furher, dates are also expressed in the forms, say, 
"Nov 6, 1986" or "November 6, 1986" or "6 Nov, 1986" or "6 November, 1986" and so on. The as.Date() function 
accepts all such character strings and when we mention the appropriate format of the string, it always outputs the 
date in the form "YYYY-m-d". 


Suppose we have a date string "9-6-1962" in the format “%d-%m-%Y ". 


# 

# It tries to interprets the string as YYYY-m-d 
# 

> as.Date("9-6-1962") 

[1] "9009-86-19" #interprets as "%Y-%m-%d" 


> 

as.Date("9/6/1962") 

[1] "@889-@6-19" #again interprets as "%Y-%m-%d" 

re 

# It has no problem in understanding, if the date is in form YYYY-m-d or YYYY/m/d 
# 

> as.Date("1962-6-9") 


[1] "1962-06-99" # no problem 
> as.Date("1962/6/9") 
[1] "1962-06-99" # no problem 


> 


By specifying the correct format of the input string, we can get the desired results. We use the following codes for 
specifying the formats to the as.Date() function. 


Format Code Meaning 

%ed day 

%m month 

%y year in 2-digits 

%Y year in 4-digits 

%b abbreviated month in 3 chars 
%B full name of the month 


Consider the following example specifying the format parameter: 
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> as.Date("9-6-1962", format="%d-%m-%Y " ) 
[1] "1962-06-09" 


> 


The parameter name format can be omitted. 


> as.Date("9-6-1962", "%d-%m-%Y") 
[1] "1962-86-09" 


> 


Some times, names of the months abbreviated to the first three characters are used in the writing the dates. In 
which case we use the format specifier %b. 


> as.Date("6Nov1962", "%d%b%Y" ) 
[1] "1962-11-06" 


> 


Note that, there are no either '-' or '/' or white spaces between the members in the date string. The format string 
should exactly match that input string. Consider the following example: 


> as.Date("6 Nov, 1962","%d %b, %Y") 
[1) “1962-11-06” 


> 


Note that, there is a comma in the date string and hence a comma in the format specification too. If comma is 
omitted in the format string, it results in an NA. An example usage of %B format specifier is as follows: 


> as.Date("October 12, 2016", "%B %d, %Y") 
[1] "2016-10-12" 

> 

> as.Date("12 October, 2016", "%d %B, %Y") 
[1] "2016-10-12" 


> 


%y format is system specific and hence, should be used with caution. Other parameters used with this function are 
origin and tz( time zone). 
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Chapter 14: Date-time classes (POSIXct 
and POSIXIt) 


R includes two date-time classes -- POSIXct and POSIXIt -- see ?7DateTimeClasses. 


Section 14.1: Formatting and printing date-time objects 


# test date-time object 
options(digits.secs = 3) 
d = as.POSIXct("2016-08-30 14:18:30.58", tz = "UTC") 


format(d,"%S") # @@-61 Second as integer 
HH lil) 2305 


format(d,"%0S") # @0-68.99... Second as fractional 
FH | 230e57.90 


format(d,"%M") # @@-59 Minute 
cae (ella) vee 


format(d,"%H") # @@-23 Hours 
cae (bil) eee 


format(d,"%I") # @1-12 Hours 
## [1] "02" 


format(d,"%p") # AM/PM Indicator 
## [1] "PM" 


format(d,"%z") # Signed offset 
## [1] "+0000" 


format(d,"%Z") # Time Zone Abbreviation 
eae lel) Hue” 


See ?strptime for details on the format strings here, as well as other formats. 


Section 14.2: Date-time arithmetic 
To add/subtract time, use POSIXct, since it stores times in seconds 


## adding/subtracting times - 6@ seconds 
as.POSIXct("2016-01-@1") + 68 
# [1] "2016-01-01 80:01:00 AEDT" 


## adding 3 hours, 14 minutes, 15 seconds 


as.POSIXct ("2016-01-01") + ( (3 * 60 * 60) + (14 * 60) + 15) 
# [1] "2016-01-01 03:14:15 AEDT" 


More formally, as.difftime can be used to specify time periods to add to a date or datetime object. E.g.: 


as.POSIXct( "2016-01-01" ) at 
as.difftime(3, units="hours") + 
as.difftime(14, units="mins") + 


as.difftime(15, units="secs") 
# [1] "2016-01-01 03:14:15 AEDT" 
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To find the difference between dates/times use dif ftime() for differences in seconds, minutes, hours, days or 
weeks. 


# using POSIXct objects 

difftime( 
as.POSIXct( "2016-01-01 12:00:00"), 
as.POSIXct( "2016-01-01 11:59:59"), 
unit = "secs" 

# Time difference of 1 secs 


To generate sequences of date-times use seq.POSIXt() or simply seq. 


Section 14.3: Parsing strings into date-time objects 


The functions for parsing a string into POSIXct and POSIXIt take similar parameters and return a similar-looking 
result, but there are differences in how that date-time is stored; see "Remarks." 


as.POSIXct("11:38", # time string 
format = "%H:%M") # formatting string 
## [1] "2016-07-21 11:38:08 CDT" 
strptime("11:38", # identical, but makes a POSIX1t object 
format = "%H:%M") 


## [1] "2016-87-21 11:38:08 CDT" 
as.POSIXct("11 AM", 


format = "%I %p") 
## [1] "2016-07-21 11:00:00 CDT" 


Note that date and timezone are imputed. 


as.POSIXct("11:38:22", # time string without timezone 
format = "%H:%M:%S", 
tz = "America/New_York") # set time zone 


## [1] "2016-07-21 11:38:22 EDT" 


as.POSIXct("2016-07-21 00:00:00", 
format = "%F %T") # shortcut tokens for "%Y-%m-%d" and "%H:%M:%S" 


See ?strptime for details on the format strings here. 


Notes 
Missing elements 


e Ifa date element is not supplied, then that from the current date is used. 
e If atime element is not supplied, then that from midnight is used, i.e. Os. 
e If no timezone is supplied in either the string or the tz parameter, the local timezone is used. 


Time zones 


e The accepted values of tz depend on the location. 

° CST is given with “CST6CDT" or “America/Chicago" 
e For supported locations and time zones use: 

o In R: OlsonNames() 

o Alternatively, try in R: system("cat SR_HOME/share/zoneinfo/zone. tab" ) 
e These locations are given by Internet Assigned Numbers Authority (IANA) 

o List of tz database time zones (Wikipedia) 
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Chapter 15: The character class 


Characters are what other languages call 'string vectors.’ 


Section 15.1: Coercion 


To check whether a value is a character use the is.character() function. To coerce a variable to a character use 
the as.character() function. 


x <- "The quick brown fox jumps over the lazy dog" 
class(x) 

[1] "character" 

is.character(x) 

[1] TRUE 


Note that numerics can be coerced to characters, but attempting to coerce a character to numeric may result in NA. 


as.numeric('"2") 

[ih 

as.numeric("fox") 

[1] NA 

Warning message: 

NAs introduced by coercion 
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Chapter 16: Numeric classes and storage 


moaes 


Section 16.1: Numeric 


Numeric represents integers and doubles and is the default mode assigned to vectors of numbers. The function 


is.numeric() will evaluate whether a vector is numeric. It is important to note that although integers and doubles 


will pass is.numeric(), the function as.numeric() will always attempt to convert to type double. 


Xe SiS 
y <- 12L 


#confirm types 
typeof (x) 
[1] "double" 


typeof (y) 
[1] "integer" 


# confirm both numeric 
is.numeric(x) 

[1] TRUE 

is.numeric(y) 

[1] TRUE 


# logical to numeric 
as.numeric( TRUE) 


ia 


# While TRUE == 1, it is a double and not an integer 
is.integer(as.numeric(TRUE) ) 


[1] FALSE 


Doubles are R's default numeric value. They are double precision vectors, meaning that they take up 8 bytes of 
memory for each value in the vector. R has no single precision data type and so all real numbers are stored in the 


double precision format. 


is.double(1) 
TRUE 
is.double(1 .@) 
TRUE 
is.double(1L) 
FALSE 


Integers are whole numbers that can be written without a fractional component. Integers are represented by a 
number with an L after it. Any number without an L after it will be considered a double. 


typeof (1) 

[1] "double" 
class(1) 

[1] “numeric” 
typeof(1L) 
[1] "integer" 
class(1L) 

[1] "integer" 


Though in most cases using an integer or double will not matter, sometimes replacing doubles with integers will 
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consume less memory and operational time. A double vector uses 8 bytes per element while an integer vector uses 
only 4 bytes per element. As the size of vectors increases, using proper types can dramatically speed up processes. 


# test speed on lots of arithmetic 
microbenchmark( 

for( i in 1:100000) { 

2 ea: 

10L + i 
i 


for( i in 1:10000@) { 
2.0* i 
10.0 +i 

} 

) 


Unit: milliseconds 
expr min lq mean median uq 
max neval 
for (i in 1:1e+05) { 2L* i 10L + i } 40.74775 42.34747 50.70543 42.99120 65.46864 
94.11804 100 
for (i in 1:1e+@5) { Jz ak 10 + i } 41.07807 42.38358 53.52588 44.26364 65.84971 
83.00456 100 
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Chapter 17: The logical class 


Logical is a mode (and an implicit class) for vectors. 


Section 17.1: Logical operators 


There are two sorts of logical operators: those that accept and return vectors of any length (elementwise operators: 
!, |, & xor()) and those that only evaluate the first element in each argument (&&, | |). The second sort is primarily 
used as the cond argument to the if function. 


Logical Operator Meaning Syntax 
! Not Ix 

& element-wise (vectorized) and x&y 
&& and (single element only) x && y 

| element-wise (vectorized) or xly 

| | or (single element only) x || y 
xor element-wise (vectorized) exclusive OR xor(x,y) 


Note that the | | operator evaluates the left condition and if the left condition is TRUE the right side is never 
evaluated. This can save time if the first is the result of a complex operation. The && operator will likewise return 
FALSE without evaluation of the second argument when the first element of the first argument is FALSE. 


> x <- 5 

> x > 6 || stop("X is too small") 
Error: X is too small 

> x > 3 || stop("X is too small") 
[1] TRUE 


To check whether a value is a logical you can use the is. logical() function. 


Section 17.2: Coercion 


To coerce a variable to a logical use the as. logical() function. 


>x <-2 
>Zz<-x>4 
az 

[1] FALSE 


> class(x) 

[1] "numeric" 

> as.logical(2) 
[1] TRUE 


When applying as.numeric() to a logical, a double will be returned. NA is a logical value and a logical operator with 
an NA will return NA if the outcome is ambiguous. 


Section 17.3: Interpretation of NAs 


See Missing values for details. 


> TRUE & NA 
[1] NA 
> FALSE & NA 
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[1] FALSE 


> TRUE || NA 
[1] TRUE 

> FALSE || NA 
[1] NA 
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Chapter 18: Data frames 


Section 18.1: Create an empty data.frame 


A data.frame is a special kind of list: it is rectangular. Each element (column) of the list has same length, and where 
each row has a "row name". Each column has its own class, but the class of one column can be different from the 
class of another column (unlike a matrix, where all elements must have the same class). 


In principle, a data.frame could have no rows and no columns: 


> structure(list(character()), class = "data.frame" ) 
NULL 
<@ rows> (or @-length row.names) 


But this is unusual. It is more common for a data.frame to have many columns and many rows. Here is a 
data.frame with three rows and two columns (a is numeric class and b is character class): 


> structure(list(a = 1:3, b = letters[1:3]), class = "data.frame") 
[1] ab 
<@ rows> (or @-length row.names) 


In order for the data.frame to print, we will need to supply some row names. Here we use just the numbers 1:3: 


> structure(list(a = 1:3, b = letters[1:3]), class = "data.frame", row.names = 1:3) 


Now it becomes obvious that we have a data.frame with 3 rows and 2 columns. You can check this using nrow(), 
ncol(), and dim(): 


> X <- structure(list(a = numeric(3), b = character(3)), class = "data.frame", row.names = 1:3) 
> nrow(x) 

[1] 3 

> ncol(x) 

2 

> dim(x) 

alii 


R provides two other functions (besides structure()) that can be used to create a data.frame. The first is called, 
intuitively, data. frame(). It checks to make sure that the column names you supplied are valid, that the list 
elements are all the same length, and supplies some automatically generated row names. This means that the 
output of data. frame() might now always be exactly what you expect: 


> str(data.frame("a a a" = numeric(3), "b-b-b" = character(3) )) 
‘data.frame': 3 obs. of 2 variables: 

S a.a.a: num @@9@ 

S$ b.b.b: Factor w/ 1 level "": 11 1 


The other function is called as.data.frame(). This can be used to coerce an object that is not a data.frame into 
being a data.frame by running it through data. frame().As an example, consider a matrix: 


> m <- matrix(letters[1:9], nrow = 3) 
> m 
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Pet te2) es 
[caren ds ones 
eral leas BS 
[sree amiatee Weuelby 


And the result: 


> as.data.frame(m) 


V1 V2 V3 

1 adéqg 

2 b eh 

RE ee af al 

> str(as.data.frame(m) ) 

‘data.frame': 3 obs. of 3 variables: 


$ V1: Factor w/ 3 levels "a","b","c": 1 2 
$ V2: Factor w/ 3 levels "d","e","f": 1 2 
$ V3: Factor w/ 3 levels "g","h","i": 1 2 


wow w 


Section 18.2: Subsetting rows and columns from a data frame 
Syntax for accessing rows and columns: [, [[, and $ 
This topic covers the most common syntax to access specific rows and columns of a data frame. These are 


e Like a matrix with single brackets data[rows, columns] 

o Using row and column numbers 

° Using column (and row) names 
e Like a list: 

o With single brackets data[columns] to get a data frame 

o With double brackets data[ [one_column]] to get a vector 
e With $ for a single column data$column_name 


We will use the built-in mtcars data frame to illustrate. 


Like a matrix: data[rows, columns] 
With numeric indexes 


Using the built in data frame mtcars, we can extract rows and columns using [] brackets with a comma included. 
Indices before the comma are rows: 


# get the first row 
mtcars[1, ] 

# get the first five rows 
mtcars[1:5, ] 


Similarly, after the comma are columns: 


# get the first column 

mtcars[, 1] 

# get the first, third and fifth columns: 
mtcars[, ¢(1, 3: 5) ) 


As shown above, if either rows or columns are left blank, all will be selected. mtcars[1, ] indicates the first row 
with al// the columns. 


With column (and row) names 
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So far, this is identical to how rows and columns of matrices are accessed. With data. frames, most of the time it is 
preferable to use a column name to a column index. This is done by using a character with the column name 
instead of numeric with a column number: 


# get the mpg column 

mtcars[, "mpg"] 

# get the mpg, cyl, and disp columns 
mtcars[|, c(’mpg , "cyl", “disp” )] 


Though less common, row names can also be used: 


mtcars[ "Mazda Rx4", ] 


Rows and columns together 
The row and column arguments can be used together: 


# first four rows of the mpg column 
mtcars[1:4, "mpg"] 


# 2nd and 5th row of the mpg, cyl, and disp columns 
mtcars[c(2, 5), c("mpg", “cyl", "“disp")] 


A warning about dimensions: 


When using these methods, if you extract multiple columns, you will get a data frame back. However, if you extract 
a single column, you will get a vector, not a data frame under the default options. 


## multiple columns returns a data frame 
class(mtcars[, c("mpg", "“cyl")]) 

# [1] "data.frame" 

## single column returns a vector 
class(mtcars[, "mpg"]) 

# [1] "numeric" 


There are two ways around this. One is to treat the data frame as a list (see below), the other is to add adrop = 
FALSE argument. This tells R to not "drop the unused dimensions": 


class(mtcars[, "mpg", drop = FALSE]) 
# [1] "data.frame" 


Note that matrices work the same way - by default a single column or row will be a vector, but if you specify drop = 
FALSE you can keep it as a one-column or one-row matrix. 


Like a list 


Data frames are essentially lists, i.e., they are a list of column vectors (that all must have the same length). Lists 
can be subset using single brackets [ for a sub-list, or double brackets [[ for a single element. 


With single brackets data[columns] 


When you use single brackets and no commas, you will get column back because data frames are lists of columns. 


mtcars[ "mpg" ] 

mtcars[c("mpg", "cyl", "disp")] 
my_columns <- c("mpg", "cyl", "hp") 
mtcars[my_columns ] 
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Single brackets /ike a list vs. single brackets like a matrix 


The difference between data[columns] and data[, columns] is that when treating the data. frame as a list (no 
comma in the brackets) the object returned will be a data. frame. If you use a comma to treat the data. frame like a 
matrix then selecting a single column will return a vector but selecting multiple columns will return a data. frame. 


## When selecting a single column 

## like a list will return a data frame 
class(mtcars["mpg"]) 

# [1] "data.frame" 

## like a matrix will return a vector 
class(mtcars[, "mpg"]) 

# [1] "numeric" 


With double brackets data[ [one_column] ] 


To extract a single column as a vector when treating your data. frame as a list, you can use double brackets [[. 
This will only work for a single column at a time. 


# extract a single column by name as a vector 
mtcars[["mpg"]] 


# extract a single column by name as a data frame (as above) 
mtcars["mpg"] 


Using $ to access columns 


A single column can be extracted using the magical shortcut $ without using a quoted column name: 


# get the column "mpg 
mtcarsSmpg 


Columns accessed by $ will always be vectors, not data frames. 
Drawbacks of $ for accessing columns 


The § can be a convenient shortcut, especially if you are working in an environment (such as RStudio) that will auto- 
complete the column name in this case. However, $ has drawbacks as well: it uses non-standard evaluation to avoid 
the need for quotes, which means it will not work if your column name is stored in a variable. 


my_column <- "mpg" 

# the below will not work 

mtcarsSmy_column 

# but these will work 

mtcars[, my_column] # vector 
mtcars[my_column] # one-column data frame 
mtcars[[my_column]] # vector 


Due to these concerns, $ is best used in interactive R sessions when your column names are constant. For 
programmatic use, for example in writing a generalizable function that will be used on different data sets with 
different column names, $ should be avoided. 


Also note that the default behaviour is to use partial matching only when extracting from recursive objects (except 
environments) by $ 


# give you the values of "mpg" column 
# as "mtcars" has only one column having name starting with "m" 
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mtcarsSm 

# will give you "NULL" 

# as "mtcars" has more than one columns having name starting with "d" 
mtcarsSd 


Advanced indexing: negative and logical indices 


Whenever we have the option to use numbers for a index, we can also use negative numbers to omit certain 
indices or a boolean (logical) vector to indicate exactly which items to keep. 


Negative indices omit elements 


mtcars[1, ] # first row 
mtcars[ -1, ] # everything but the first row 
mtcars[-(1:10), ] # everything except the first 10 rows 


Logical vectors indicate specific elements to keep 
We can use a condition such as < to generate a logical vector, and extract only the rows that meet the condition: 


# logical vector indicating TRUE when a row has mpg less than 15 
# FALSE when a row has mpg >= 15 
test <- mtcarsSmpg < 15 


# extract these rows from the data frame 
mtcars[test, ] 


We can also bypass the step of saving the intermediate variable 
# extract all columns for rows where the value of cyl is 4. 
mtcars[mtcarsScyl == 4, ] 


# extract the cyl, mpg, and hp columns where the value of cyl is 4 
mtcars[mtcarsScyl == 4, c("cyl", "mpg", “hp")] 


Section 18.3: Convenience functions to manipulate 
data.frames 


Some convenience functions to manipulate data.frames are subset(), transform(), with() and within(). 
subset 


The subset() function allows you to subset a data. frame in a more convenient way (Subset also works with other 
classes): 


subset(mtcars, subset = cyl == 6, select = c("mpg", "hp")) 


mpg hp 
Mazda RX4 21.0 110 
Mazda RX4 Wag 21.0 110 
Hornet 4 Drive 21.4 110 
Valiant U3851, 105 
Merc 280 19R2 123 
Merc 280C Wes l23 
Ferrari Dino Oo ail7.5 


In the code above we asking only for the lines in which cyl == 6 and for the columns mpg and hp. You could achieve 
the same result using [ ] with the following code: 


mtcars[mtcarsScyl == 6, c("mpg", "“hp")] 


Goalkicker.com - R Notes for Professionals 59 


transform 


The transform() function is a convenience function to change columns inside a data. frame. For instance the 
following code adds another column named mpg2 with the result of mpg*2 to the mtcars data. frame: 


mtcars <- transform(mtcars, mpg2 = mpg‘’2) 


with and within 


Both with() and within() let you to evaluate expressions inside the data. frame environment, allowing a 
somewhat cleaner syntax, saving you the use of some $ or []. 


For example, if you want to create, change and/or remove multiple columns in the airquality data. frame: 


aq <- within(airquality, { 
10zone <- log(Ozone) # creates new column 
Month <- factor(month.abb[Month]) # changes Month Column 
cTemp <- round((Temp - 32) * 5/9, 1) # creates new column 
S.cT <- Solar.R / cTemp # creates new column 
rm(Day, Temp) # removes columns 


}) 


Section 18.4: Introduction 


Data frames are likely the data structure you will used most in your analyses. A data frame is a special kind of list 
that stores same-length vectors of different classes. You create data frames using the data. frame function. The 
example below shows this by combining a numeric and a character vector into a data frame. It uses the : operator, 
which will create a vector containing all integers from 1 to 3. 


df1 <- data.frame(x = 1:3, y = c("a", "b", "c")) 


## [1] "data. frame" 
Data frame objects do not print with quotation marks, so the class of the columns is not always obvious. 


df2 <- data.frame(x = c("1", "2", "3"), y = e("a", "b", "c")) 


df2 

## xGY 
## 1412 
## 2 2b 
## 3 3.c 


Without further investigation, the "x" columns in df1 and df2 cannot be differentiated. The str function can be 
used to describe objects with more detail than class. 


str(df1) 

## ‘data.frame': 3 obs. of 2 variables: 

ie Shee aie. 1) BS 

HH Sys Facto W/ 3 levels, al, .b, «co! 10253 
str(df2) 

## ‘data.frame': 3 obs. of 2 variables: 
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Hi OO RACtOm W/o LeVelGu ale weet Su te a2a3 
He SEY ACtOln Wiis WeVvelisa di Duye CL coll 2.3 


Here you see that df1 is a data. frame and has 3 observations of 2 variables, "x" and "y." Then you are told that "x" 
has the data type integer (not important for this class, but for our purposes it behaves like a numeric) and "y" is a 
factor with three levels (another data class we are not discussing). It is important to note that, by default, data 


frames coerce characters to factors. The default behavior can be changed with the stringsAsFactors parameter: 


df3 <- data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE) 


str(df3) 
## ‘data.frame': 370bSs Oh, 2Vaktables: 


pa SS) See Se, “Th vee! 


war Oe) Clic “a ala! 


Cc 


Now the "y" column is a character. As mentioned above, each "column" of a data frame must have the same length. 


Trying to create a data.frame from vectors with different lengths will result in an error. (Try running data. frame(x 


= 1:3, y = 1:4) to see the resulting error.) 


As test-cases for data frames, some data is provided by R by default. One of them is iris, loaded as follows: 


mydataframe <- iris 
str(mydataframe) 


Section 18.5: Convert all columns of a data.frame to 
character class 


A common task is to convert all columns of a data.frame to character class for ease of manipulation, such as in the 
cases of sending data.frames to a RDBMS or merging data.frames containing factors where levels may differ 


between input data.frames. 


The best time to do this is when the data is read in - almost all input methods that create data frames have an 
options stringsAsFactors which can be set to FALSE. 


If the data has already been created, factor columns can be converted to character columns as shown below. 


bob <- data.frame(jobs = c("scientist", "analyst"), 
pay = c(160000, 100000), age = c(3@, 25)) 
str(bob) 
‘data.frame': 2 obs. of 3 variables: 


$ jobs: Factor w/ 2 levels "analyst","scientist": 2 1 
S$ pay : num 160000 100000 
S$ age : num 3@ 25 


# Convert *all columns* to character 
bob[] <- lapply(bob, as.character) 
str(bob) 


‘data.frame': 2 obs. of 3 variables: 
$ jobs: chr "scientist" "analyst" 
S pay : chr "160000" "1e+@5" 
Svage = chm “30 925° 


# Convert only factor columns to character 
bob[] <- lapply(bob, function(x) { 
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if is.factor(x) x <- as.character(x) 
return(x) 


}) 
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Chapter 19: Split function 
Section 19.1: Using split in the split-apply-combine paradigm 


A popular form of data analysis is split-apply-combine, in which you split your data into groups, apply some sort of 
processing on each group, and then combine the results. 


Let's consider a data analysis where we want to obtain the two cars with the best miles per gallon (mpg) for each 
cylinder count (cyl) in the built-in mtcars dataset. First, we split the mtcars data frame by the cylinder count: 


(spl <- split(mtcars, mtcarsScyl)) 


HOST > 

# mpg cyl disp hp drat wt qsec vs am gear carb 

# Datsun 710 2258: 41108..0 29343185) -2.520 18.6 ele 4 1 

# Merc 24@D 24.4 4 146.7 62 3.69 3.1980 20.00 7) 4 2: 

# Merc 230 22585 414058 9573292 SE 150722290) 0 4 2 

# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 

#. 

# 

#S°6° 

# mpg cyl disp hp drat wt qsec vs am gear carb 

# Mazda RX4 2170) 16 116050 110) 3290527620 6s465 On 41 4 4 

# Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 @ 1 4 4 

# Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 @ 3 1 

# Valiant V8 622520) 1105.27,7653- 460) 20022 15 30 3 1 

# . 

# 

#S$°8° 

# mpg cyl disp hp drat wt qsec vs am gear carb 
# Hornet Sportabout Tske// 8 3605017535115 3.446) 17502 0) 38 3 2 
# Duster 360 143) 187360. O) 245 35:21 3-570) 15.84: 8) 8 3 4 
# Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.46 0 @ 3 3 
# Merc 45@SL 723 wee 27M Oo OO! SOs 3197307, 100m OuaG 3 3 
# . 


This has returned a list of data frames, one for each cylinder count. As indicated by the output, we could obtain the 
relevant data frames with sp1$°4°, sp1$°6°, and sp1$*8° (some might find it more visually appealing to use 
sp1$"4" or sp1[["4"]] instead). 


Now, we can use lapply to loop through this list, applying our function that extracts the cars with the best 2 mpg 
values from each of the list elements: 


(best2 <- lapply(spl, function(x) tail(x[order(xSmpg),], 2))) 


#$°4° 
# mpg cyl disp hp drat wt qsec vs am gear carb 
# Fiat 128 SL AA e/SeOOr4 O82" 200) 19. Ayia lesa 4 1 


# Toyota Corolla 33-9) 4: 7151 65,4522) i835 19-90 1 1 4 1 


bop. 
(o>) 


mpg cyl disp hp drat wt qsec vs am gear carb 
Mazda RX4 Wag 21.0 6 160 110 3.98 2.875 17.02 @ 1 4 4 
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 @ 3 1 


HR HH HH HH 
wm 
foe) 


mpg cyl disp hp drat wt qsec vs am gear carb 
# Hornet Sportabout 18.7 8.) 300) 175535 3-440i017 7202 030 3 2: 
# Pontiac Firebird 19.2 8 400 175 3.08 3.845 17.05 0 @ 3 2 
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Finally, we can combine everything together using rbind. We want to call rbind(best2[["4"]], best2[["6"]], 
best2[["8"]]), but this would be tedious if we had a huge list. As a result, we use: 


do.call(rbind, best2) 


# mpg cyl disp hp drat wt qsec vs am gear carb 
# 4.Fiat 128 S264 ed 787 9-667 48088 2,200 19-47 sal sl 4 1 
# 4.Toyota Corolla 2209 Ae alien OO a4 22201835) 19,90 ile A 4 1 
# 6.Mazda RX4 Wag 2A 0 SO GOON Ges 90" 2) 8757102 One I 4 4 
# 6.Hornet 4 Drive 21-4> 6 258) GF 1110 308 32215 19244" 1 3 1 
# 8.Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.82 0 @ 3 2 
# 8.Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 @ @ 3 2 


This returns the result of rbind (argument 1, a function) with all the elements of best2 (argument 2, a list) passed as 
arguments. 


With simple analyses like this one, it can be more compact (and possibly much less readable!) to do the whole split- 
apply-combine in a single line of code: 


do.call(rbind, lapply(split(mtcars, mtcarsScyl), function(x) tail(x[order(xSmpg),], 2))) 


It is also worth noting that the lapply(split(x,f), FUN) combination can be alternatively framed using the ?by 
function: 


by(mtcars, mtcarsScyl, function(x) tail(x[order(x$mpg),], 2)) 
do.call(rbind, by(mtcars, mtcarsScyl, function(x) tail(x[order(xSmpg),], 2))) 


Section 19.2: Basic usage of split 


split allows to divide a vector or a data.frame into buckets with regards to a factor/group variables. This 
ventilation into buckets takes the form of a list, that can then be used to apply group-wise computation (for loops 
or lapply/sapply). 


First example shows the usage of split on a vector: 


Consider following vector of letters: 


testdata es ote Oe ee gia aus. ce EW cite Tae oe Die tye etl cali: ne) 


Objective is to separate those letters into voyels and consonants, ie split it accordingly to letter type. 


Let's first create a grouping vector: 


VOWElS: <a C(@ta aren die Ou en Ua Vim) 
letter_type <- ifelse(testdata %in% vowels, "vowels", "consonants" ) 


Note that letter_type has the same length that our vector testdata. Now we can sp1it this test data in the two 
groups, vowels and consonants : 


split(testdata, letter_type) 
#Sconsonants 
#[1] Nie Mele a Gia Me) Bi BN ueNe Daye 


#Svowels 
#[1] Weal Way? sey LV it ae 
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Hence, the result is a list which names are coming from our grouping vector/factor letter_type. 
split has also a method to deal with data.frames. 

Consider for instance iris data: 

data(iris) 

By using split, one can create a list containing one data.frame per iris specie (variable: Species): 


> liris <- split(iris, irisSSpecies) 
> names(liris) 
[1] “setosa’ "versicolor" "virginica" 
> head(lirisSsetosa) 
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 


1 Si I 325 ied @.2 setosa 
fi 4.9 38 1.4 @.2 setosa 
3 aay See. 13 @.2 setosa 
4 4.6 Seal ils) @.2 setosa 
5 5.0 36 Ae4 @.2 setosa 
6 5.4 3.9 ee! @.4 setosa 


(contains only data for setosa group). 
One example operation would be to compute correlation matrix per iris specie; one would then use lapply: 
> (lcor <- lapply(liris, FUN=function(df) cor(df[,1:4]))) 

Ssetosa 


Sepal.Length Sepal.Width Petal.Length Petal.Width 
Sepal.Length 1.0000000  @.7425467 @.2671758 @.2780984 


Sepal .Width @.7425467 1.8600000 8.1777000 8@.2327520 
Petal.Length @.2671758 8.177700 1.80000000 8.3316300 
Petal .Width 8 .2780984 @.2327520 8.3316300 1.8000000 
Sversicolor 


Sepal.Length Sepal.Width Petal.Length Petal.Width 
Sepal.Length 1.0000000 @.5259107 @.7540490 @6.5464611 


Sepal .Width @.5259107 1 .8000000 @.5605221 @.6639987 
Petal.Length @.7540496 @.5605221 1.0000000 @.7866681 
Petal .Width @.5464611 @.6639987 @.7866681 1.0000000 
Svirginica 


Sepal.Length Sepal.Width Petal.Length Petal.Width 
Sepal.Length 1.0000000 #@.4572278 @.8642247 @.2811077 


Sepal .Width @.4572278 1 .8000000 @.4010446 @.53772880 
Petal.Length @.8642247 @.4010446 1.0000000 0.3221082 
Petal .Width @.2811077 —8'.5377280 @.3221082 1.9000000 


Then we can retrieve per group the best pair of correlated variables: (correlation matrix is reshaped/melted, 
diagonal is filtered out and selecting best record is performed) 


> library(reshape) 

> (topcor <- lapply(lcor, FUN=function(cormat) { 
correlations <- melt(cormat, variable_name="correlatio) ; 
filtered <- correlations[correlations$X1 != correlationsSx2, ]; 
filtered[which.max(filteredScorrelation), ] 


})) 
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Ssetosa 
X1 X2 
2 Sepal.Width Sepal.Length 


Sversicolor 
X1 X2 
12 Petal.Width Petal.Length 


Svirginica 
X1 X2 
3 Petal.Length Sepal.Length 


correlation 
@.7425467 


correlation 
@.7866681 


correlation 
@.8642247 


Note that one computations are performed on such groupwise level, one may be interested in stacking the results, 


which can be done with: 


> (result <- do.call("rbind", topcor) ) 


X1 


X2 correlation 


setosa Sepal.Width Sepal.Length 
versicolor Petal.Width Petal.Length 
virginica Petal.Length Sepal.Length 
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Chapter 20: a and writing tabular 
data in plain-text files (CSV, TSV, etc.) 


Parameter Details 
file name of the CSV file to read 
header logical: does the .csv file contain a header row with column names? 
sep character: symbol that separates the cells on each row 
quote character: symbol used to quote character strings 
dec character: symbol used as decimal separator 
fill logical: when TRUE, rows with unequal length are filled with blank fields. 


comment.char character: character used as comment in the csv file. Lines preceded by this character are ignored. 
extra arguments to be passed to read. table 


Section 20.1: Importing .csv files 
Importing using base R 


Comma separated value files (CSVs) can be imported using read. csv, which wraps read.table, but uses sep = "," 
to set the delimiter to a comma. 


# get the file path of a CSV included in R's utils package 
csv_path <- system.file("misc", "exDIF.csv", package = "utils") 


# path will vary based on installation location 
csv_path 
## [1] "/Library/Frameworks/R.framework/Resources/library/utils/misc/exDIF.csv" 


df <- read.csv(csv_path) 
df 


## Var1 Var2 
## 1° 2.70 A 


## 2 3.14 B 
## 3 10.00 A 
## 4 -7.00 A 


A user friendly option, file.choose, allows to browse through the directories: 


df <- read.csv(file.choose()) 


Notes 


e Unlike read. table, read.csv defaults to header = TRUE, and uses the first row as column names. 

e All these functions will convert strings to factor class by default unless either as.is = TRUE or 
stringsAsFactors = FALSE. 

e The read.csv2 variant defaults to sep = ";" anddec = "," for use on data from countries where the 
comma is used as a decimal point and the semicolon as a field separator. 


Importing using packages 


The readr package's read_csv function offers much faster performance, a progress bar for large files, and more 
popular default options than standard read.csv, including stringsAsFactors = FALSE. 
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library(readr) 
df <- read_csv(csv_path) 


df 

## # A tibble: 4 x 2 
## Var1 Var2 

## <dbl> <chr> 


## 1 2.70 A 
## 2 3.14 B 
## 3 10.00 A 
## 4 -7.00 A 


Section 20.2: Importing with data.table 


The data. table package introduces the function fread. While it is similar to read. table, fread is usually faster and 
more flexible, guessing the file's delimiter automatically. 


# get the file path of a CSV included in R's utils package 
csv_path <- system.file("misc", "exDIF.csv", package = "“utils") 


# path will vary based on R installation location 
csv_path 


## [1] "/Library/Frameworks/R.framework/Resources/library/utils/misc/exDIF.csv" 


dt <- fread(csv_path) 


dt 

## Var1 Var2 
## 1: 2.70 A 
## 2: 3.14 B 
## 3: 10.00 A 
## 4: -7.00 A 


Where argument input is a string representing: 


e the filename (e.g. “filename.csv"), 
e ashell command that acts ona file (e.g. “grep ‘word’ filename"), or 
e the input itself (eg. "input1, input2 \n A, B \nC, D"). 


fread returns an object of class data.table that inherits from class data. frame, suitable for use with the 
data.table's usage of []. To return an ordinary data.frame, set the data. table parameter to FALSE: 


df <- fread(csv_path, data.table = FALSE) 


class(df) 
## [1] "data. frame" 


df 

## Varl1 Var2 
## 1° 2.70 A 
## 2 3.14 B 
## 3 10.00 A 
## 4 -7.00 A 


Notes 


e fread does not have all same options as read. table. One missing argument is na.comment, which may lead 
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in unwanted behaviors if the source file contains #. 
e fread uses only " for quote parameter. 
e fread uses few (5) lines to guess variables types. 


Section 20.3: Exporting .csv files 
Exporting using base R 
Data can be written to a CSV file using write.csv(): 


write.csv(mtcars, "mtcars.csv") 


Commonly-specified parameters include row.names = FALSE andna = “". 


Exporting using packages 
readr: :write_csv is significantly faster than write.csv and does not write row names. 


library(readr) 


write_csv(mtcars, "mtcars.csv") 


Section 20.4: Import multiple csv files 


files = list.files(pattern="*.csv") 
data_list = lapply(files, read.table, header = TRUE) 


This read every file and adds it to a list. Afterwards, if all data.frame have the same structure they can be combined 
into one big data.frame: 


df <- do.call(rbind, data_list) 


Section 20.5: Importing fixed-width files 


Fixed-width files are text files in which columns are not separated by any character delimiter, like , or ;, but rather 
have a fixed character length (width). Data is usually padded with white spaces. 


An example: 

Columnl Column2 Column3 Column4Column5 
1647 pi ‘important' 3.141596 .28318 
1731 euler ‘quite important' 2.718285.43656 
1979 answer ‘The Answer. ' 42 42 


Let's assume this data table exists in the local file constants. txt in the working directory. 


Importing with base R 
df <- read. fwf('constants.txt', widths = c(8,10,18,7,8), header = FALSE, skip = 1) 


#> V1 V2 V3 V4 V5 
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#> 1 1647 pi ‘important’ 3.14159 6.28318 
#> 2 1731 euler ‘quite important' 2.71828 5.43656 
#> 3 1979 answer ‘The Answer.’ 42 42.8000 


Note: 


¢ Column titles don't need to be separated by a character (Column4Column5) 
e The widths parameter defines the width of each column 
¢ Non-separated headers are not readable with read. fwf() 


Importing with readr 
library(readr) 


df <- read_fwf('constants.txt', 
fwf_cols(Year = 8, Name = 10, Importance = 18, Value = 7, Doubled = 8), 


skip = 1) 
df 
#> # A tibble: 3 x 5 
#> Year Name Importance Value Doubled 
#> <int> <chr> <chr> <db1> <db1> 
#> 1 1647 pi ‘important’ 3.14159 6.28318 
#> 2 1731 euler ‘quite important’ 2.71828 5.43656 
#> 3 1979 answer "The Answer.' 42.00000 42.0@0000 
Note: 


e readr's fwf_* helper functions offer alternative ways of specifying column lengths, including automatic 
guessing (fwf_empty) 

e readr is faster than base R 

¢ Column titles cannot be automatically imported from data file 
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Chapter 21: Pipe operators (%>% and 
others) 


lhs rhs 
A value or the magrittr placeholder. A function call using the magrittr semantics 


Pipe operators, available in magrittr, dplyr, and other R packages, process a data-object using a sequence of 
operations by passing the result of one step as input for the next step using infix-operators rather than the more 
typical R method of nested function calls. 


Note that the intended aim of pipe operators is to increase human readability of written code. See Remarks section 
for performance considerations. 


Section 21.1: Basic use and chaining 


The pipe operator, %>%, is used to insert an argument into a function. It is not a base feature of the language and 
can only be used after attaching a package that provides it, such as magrittr. The pipe operator takes the left-hand 
side (LHS) of the pipe and uses it as the first argument of the function on the right-hand side (RHS) of the pipe. For 
example: 


library(magrittr) 


1:10 %>% mean 
ee A Bie 


# is equivalent to 
mean(1:10) 
ca Gall eas 


The pipe can be used to replace a sequence of function calls. Multiple pipes allow us to read and write the 
sequence from left to right, rather than from inside to out. For example, suppose we have years defined as a factor 
but want to convert it to a numeric. To prevent possible information loss, we first convert to character and then to 
numeric: 


years <- factor(2008:2012) 


# nesting 
as.numeric(as.character (years) ) 


# piping 
years %>% as.character %>% as.numeric 


If we don't want the LHS (Left Hand Side) used as the first argument on the RHS (Right Hand Side), there are 
workarounds, such as naming the arguments or using . to indicate where the piped input goes. 


# example with grepl 
# its syntax: 
# grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE) 


# note that the “substring result is the *2nd* argument of grepl 
grepl("Wo", substring("Hello World", 7, 11)) 


# piping while naming other arguments 
"Hello World" %>% substring(7, 11) %>% grepl(pattern = "Wo") 
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# piping with . 
"Hello World" %>% substring(7, 11) %>% grep1("Wo", .) 


# piping with . and curly braces 
"Hello World" %>% substring(7, 11) %>% { c(paste('Hi', .)) } 
#[1] "Hi World" 


#using LHS multiple times in argument with curly braces and . 
"Hello World" %>% substring(7, 11) %>% { c(paste(. ,'Hi', .)) } 
#[1] "World Hi World" 


Section 21.2: Functional sequences 


Given a sequence of steps we use repeatedly, it's often handy to store it in a function. Pipes allow for saving such 
functions in a readable format by starting a sequence with a dot as in: 


. %>% RHS 
As an example, suppose we have factor dates and want to extract the year: 


library(magrittr) # needed to include the pipe operators 
library(lubridate) 
read_year <- . %>% as.character %>% as.Date %>% year 


# Creating a dataset 

df <- data.frame(now = "2015-11-11", before = "2012-01-01") 
# now before 

# 1 2015-11-11 2012-01-01 


# Example 1: applying ~read_year*’ to a single character-vector 
dfSnow %>% read_year 
# [1] 2015 


# Example 2: applying “read_year* to all columns of ‘df-° 

df %>% lapply(read_year) %>% as.data.frame # implicit ‘lapply(df, read_year) 
# now before 

#1 2015 2012 


# Example 3: same as above using “mutate_all- 
library(dplyr) 

df %>% mutate_all(funs(read_year) ) 

# if an older version of dplyr use “mutate_each° 


# now before 
#1 2015 2012 


We can review the composition of the function by typing its name or using functions: 


read_year 
Functional sequence with the following components: 


# 

# 

# 1. as.character(.) 
# 2. as.Date(.) 

# 3. year(.) 

# 

# 


Use 'functions' to extract the individual functions. 
We can also access each function by its position in the sequence: 


read_year[[2] ] 
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# TUNGeLOM () 
# as.Date(.) 


Generally, this approach may be useful when clarity is more important than speed. 


Section 21.3: Assignment with %<>% 


The magrittr package contains a compound assignment infix-operator, %<>%, that updates a value by first piping it 
into one or more rhs expressions and then assigning the result. This eliminates the need to type an object name 
twice (once on each side of the assignment operator <-). %<>% must be the first infix-operator in a chain: 


library(magrittr) 
library(dplyr ) 


df <- mtcars 

Instead of writing 

df <- df %>% select(1:3) %>% filter(mpg > 20, cyl == 6) 
or 

df %>% select(1:3) %>% filter(mpg > 20, cyl == 6) -> df 
The compound assignment operator will both pipe and reassign df: 


df %<>% select(1:3) %>% filter(mpg > 20, cyl == 6) 


Section 21.4: Exposing contents with %$% 


The exposition pipe operator, %$%, exposes the column names as R symbols within the left-hand side object to the 
right-hand side expression. This operator is handy when piping into functions that do not have a data argument 
(unlike, say, Im) and that don't take a data.frame and column names as arguments (most of the main dplyr 
functions). 


The exposition pipe operator %$% allows a user to avoid breaking a pipeline when needing to refer to column 
names. For instance, say you want to filter a data.frame and then run a correlation test on two columns with 


cor.test: 


library(magrittr) 
library(dplyr ) 
mtcars %>% 
filter(wt > 2) %$% 
cor.test(hp, mpg) 


#> Pearson's product-moment correlation 


#> data: hp and mpg 

#> t = -5.9546, df = 26, p-value = 2.768e-06 

#> alternative hypothesis: true correlation is not equal to @ 
#> 95 percent confidence interval: 

#> -@.8825498 -@.5393217 

#> sample estimates: 

#> cor 
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#> -@.7595673 


Here the standard %>% pipe passes the data.frame through to filter(), while the %$% pipe exposes the column 
names to cor.test(). 


The exposition pipe works like a pipe-able version of the base R with() functions, and the same left-hand side 
objects are accepted as inputs. 


Section 21.5: Creating side effects with %T>% 


Some functions in R produce a side effect (i.e. saving, printing, plotting, etc) and do not always return a meaningful 
or desired value. 


%T>% (tee operator) allows you to forward a value into a side-effect-producing function while keeping the original 
1hs value intact. In other words: the tee operator works like %>%, except the return values is Lhs itself, and not the 
result of the rhs function/expression. 


Example: Create, pipe, write, and return an object. If %>% were used in place of %T>% in this example, then the 
variable all_letters would contain NULL rather than the value of the sorted object. 


all_letters <- c(letters, LETTERS) %>% 
sort %T>% 
write.csv(file = "all_letters.csv") 


read.csv("all_letters.csv") %>% head() 


# HR HR HHH 


AuahWDND 
(el keh Mesll tom facie os 


Warning: Piping an unnamed object to save() will produce an object named . when loaded into the workspace 
with load(). However, a workaround using a helper function is possible (which can also be written inline as an 
anonymous function). 


all_letters <- c(letters, LETTERS) %>% 
sort %T>% 
save(file = "all_letters.RData") 


load("all_letters.RData", e <- new.env()) 


get("all_letters", envir = e) 

# Error in get("all_letters", envir = e) : object ‘all_letters' not found 

get(".", envir = e) 

ce MIN PR? YN Uh REI Uo eee Myo he OD) Site Re Wah RT of ket eye Mate Mat ate eee 
ce PAI Ml ee al PL Se NP ee PO ero EE Me ee ie Te ee i eer 
FE AUPA A saline © cleans tuys Wee Wise eri eX ee Vans usu zn 


# Work-around 


save2 <- function(. = ., name, file = stop("'file' must be specified")) { 
assign(name, .) 
call_save <- call("save", ... = name, file = file) 
eval(call_save) 

} 
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all_letters <- c(letters, LETTERS) %>% 
sort %T>% 
save2("all_letters", "all_letters.RData") 


Section 21.6: Using the pipe with dplyr and ggplot2 


The %>% operator can also be used to pipe the dplyr output into ggplot. This creates a unified exploratory data 
analysis (EDA) pipeline that is easily customizable. This method is faster than doing the aggregations internally in 
ggplot and has the added benefit of avoiding unnecessary intermediate variables. 


library(dplyr) 
library(ggplot) 


diamonds %>% 
filter(depth > 60) %>% 
group_by(cut) %>% 
summarize(mean_price = mean(price)) %>% 
ggplot(aes(x = cut, y = mean_price)) + 
geom_bar(stat = “identity") 
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Chapter 22: Linear Models (Regression) 


Parameter Meaning 
a formula in Wilkinson-Rogers notation; response ~ ... where ... contains terms corresponding to 


foie variables in the environment or in the data frame specified by the data argument 
data data frame containing the response and predictor variables 


alibest a vector specifying a subset of observations to be used: may be expressed as a logical statement in 
terms of the variables in data 

weights analytical weights (see Weights section above) 

na.action how to handle missing (NA) values: see ?na.action 


how to perform the fitting. Only choices are "qr" or “model. frame" (the latter returns the model frame 


ened without fitting the model, identical to specifying model=TRUE) 

model whether to store the model frame in the fitted object 

Xx whether to store the model matrix in the fitted object 

y whether to store the model response in the fitted object 

qr whether to store the QR decomposition in the fitted object 

singular.ok whether to allow singular fits, models with collinear predictors (a subset of the coefficients will 


automatically be set to NA in this case 


a list of contrasts to be used for particular factors in the model; see the contrasts.arg argument of 
contrasts ?model.matrix.default. Contrasts can also be set with options() (see the contrasts argument) or by 
assigning the contrast attributes of a factor (see ?contrasts) 


used to specify an a priori known component in the model. May also be specified as part of the 


offset formula. See ?model. offset 


additional arguments to be passed to lower-level fitting functions (Im. fit() or lm.wfit()) 


Section 22.1: Linear regression on the mtcars dataset 


The built-in mtcars data frame contains information about 32 cars, including their weight, fuel efficiency (in miles- 
per-gallon), speed, etc. (To find out more about the dataset, use help(mtcars)). 


If we are interested in the relationship between fuel efficiency (mpg) and weight (wt) we may start plotting those 
variables with: 


plot(mpg ~ wt, data = mtcars, col=2) 


The plots shows a (linear) relationship!. Then if we want to perform linear regression to determine the coefficients 
of a linear model, we would use the 1m function: 


fit <- Im(mpg ~ wt, data = mtcars) 


The ~ here means "explained by", so the formula mpg ~ wt means we are predicting mpg as explained by wt. The 
most helpful way to view the output is with: 


summary (fit) 
Which gives the output: 


Call: 
1m(formula = mpg ~ wt, data = mtcars) 


Residuals: 
Min 1Q Median 3Q Max 
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-4.5432 -2.3647 -@.1252 1.4096 6.8727 


Coefficients: 

Estimate Std. Error t value Pr(>|t|) 
(Intercept) 37.2851 1.8776 19.858 < 2e-16 xxx 
wt -5.3445 @.5591 -9.559 1.29e-18 xxx 


Signif. codes: @ ‘xxx’ 9.001 ‘**’ @.01 ‘* @.05 ‘.’ 0.1 ‘7 1 


Residual standard error: 3.046 on 30 degrees of freedom 
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446 
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10 


This provides information about: 


e the estimated slope of each coefficient (wt and the y-intercept), which suggests the best-fit prediction of mpg 
is 37.2851 + (-5.3445) * wt 

e The p-value of each coefficient, which suggests that the intercept and weight are probably not due to chance 

e Overall estimates of fit such as R42 and adjusted R42, which show how much of the variation in mpg is 
explained by the model 


We could add a line to our first plot to show the predicted mpg: 
abline( fit, col=3, lwd=2) 


It is also possible to add the equation to that plot. First, get the coefficients with coef. Then using paste@ we 
collapse the coefficients with appropriate variables and +/-, to built the equation. Finally, we add it to the plot using 
mtext: 
bs <- round(coef(fit), 3) 
Imlab <- paste@("mpg = ", bs[1], 

ifelse(sign(bs[2])==1, "+", "- "), abs(bs[2]), " wt ") 
mtext(lmlab, 3, line=-2) 


The result is: 
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mpg = 37.285 - 5.344 wt 
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Section 22.2: Using the ‘predict’ function 


Once a model is built predict is the main function to test with new data. Our example will use the mtcars built-in 
dataset to regress miles per gallon against displacement: 


my_md1 <- 1m(mpg ~ disp, data=mtcars) 
my_md1 


Call: 
1m(formula = mpg ~ disp, data = mtcars) 


Coefficients: 
(Intercept) disp 
29 .59985 -@.04122 


If | had a new data source with displacement | could see the estimated miles per gallon. 


set.seed(1234) 

newdata <- sample(mtcarsSdisp, 5) 
newdata 

[1] 25820) 7 75.7 145.0) 400.0 


newdf <- data. frame(disp=newdata) 
predict(my_mdl, newdf) 

1 Zi cS 4 0) 
18.96635 26.66946 26.47987 23 .62366 13.11381 


The most important part of the process is to create a new data frame with the same column names as the original 
data. In this case, the original data had a column labeled disp, | was sure to call the new data that same name. 
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Caution 


Let's look at a few common pitfalls: 


1. not using a data.frame in the new object: 


predict(my_mdl, newdata) 
Error in eval(predvars, data, env) 


2. not using same names in new data frame: 


newdf2 <- data. frame(newdata) 
predict(my_mdl, newdf2) 
Error in eval(expr, envir, enclos) 


numeric ‘envir' arg not of length one 


Accuracy 


: object ‘disp’ not found 


To check the accuracy of the prediction you will need the actual y values of the new data. In this example, newdf will 


need a column for 'mpg' and ‘disp’. 


newdf <- data. frame(mpg=mtcarsSmpg[1:10], disp=mtcars$disp[1:10]) 
disp 


# oR HH HH HH HH HK 
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p <- predict(my_mdl, newdf) 


#root mean square error 
sqrt(mean((p - newdfSmpg)*2, na.rm=TRUE) ) 
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Section 22.3: Weighting 


RON WDWAVAAVOAOVOO OO 


2) 


Sometimes we want the model to give more weight to some data points or examples than others. This is possible 


by specifying the weight for the input data while learning the model. There are generally two kinds of scenarios 


where we might use non-uniform weights over the examples: 


e Analytic Weights: Reflect the different levels of precision of different observations. For example, if analyzing 


data where each observation is the average results from a geographic area, the analytic weight is 


proportional to the inverse of the estimated variance. Useful when dealing with averages in data by providing 
a proportional weight given the number of observations. Source 


e Sampling Weights (Inverse Probability Weights - IPW): a statistical technique for calculating statistics 
standardized to a population different from that in which the data was collected. Study designs with a 
disparate sampling population and population of target inference (target population) are common in 
application. Useful when dealing with data that have missing values. Source 
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The 1m() function does analytic weighting. For sampling weights the survey package is used to build a survey 


design object and run svyg1m(). By default, the survey package uses sampling weights. (NOTE: 1m(), and svyg1m() 


with family gaussian() will all produce the same point estimates, because they both solve for the coefficients by 
minimizing the weighted least squares. They differ in how standard errors are calculated.) 


Test Data 


data <- structure(list(lexptot = c(9.1595012302023, 9.86330744180814, 

8 .92372556833205, 8.58202430280175, 10.1133857229336), progvillm = c(1L, 
1L, 1L, 1L, @L), sexhead = c(1L, 1L, OL, 1L, 1L), agehead = c(79L, 

43L, 52L, 48L, 35L), weight = c(1.04273509979248, 1.01139605045319, 


1.01139605045319, 1.01139605045319, @.76305216550827)), .Names = c("lexptot", 


"progvillm", "sexhead", "agehead", "weight"), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -5L)) 


Analytic Weights 


lm.analytic <- lm(lexptot ~ progvillm + sexhead + agehead, 
data = data, weight = weight) 
summary(1lm.analytic) 


Output 


Call: 
1lm(formula = lexptot ~ progvillm + sexhead + agehead, data = data, 
weights = weight) 


Weighted Residuals: 
1 2 3 4 5 
9.249e-82 5.823e-01 0.000e+00 -6.762e-01 -1.527e-16 


Coefficients: 

Estimate Std. Error t value Pr(>|t|) 
(Intercept) 10.016054 1.744293 5.742 @.118 
progvillm -8.781204 1.344974 -@.581 @.665 
sexhead @.306742 1.040625 @.295 @.818 
agehead -@.005983 @.032024 -@.187 @.882 


Residual standard error: @.8971 on 1 degrees of freedom 


Multiple R-squared: 0.467, Adjusted R-squared: -1.132 
F-statistic: @.2921 on 3 and 1 DF, p-value: @.8386 


Sampling Weights (IPW) 


library(survey) 
dataSX <- 1:nrow(data) # Create unique id 


# Build survey design object with unique id, ipw, and data.frame 
des1 <- svydesign(id = ~X, weights = ~weight, data = data) 


# Run glm with survey design object 
prog.lm <- svyglm(lexptot ~ progvillm + sexhead + agehead, design=des1) 


Output 


Call: 
svyglm(formula = lexptot ~ progvillm + sexhead + agehead, design = des1) 
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Survey design: 
svydesign(id = ~X, weights = ~weight, data = data) 


Coefficients: 
Estimate Std. Error t value Pr(>|t|) 
(Intercept) 10.016054 @.183942 54.452 @.0117 x 


progvillm -@.781204 0.640372 -1.220 0.4371 
sexhead Q'.306742 0.397089 ©@.772 825813 
agehead -8.005983 @.014747 -@.406 @.7546 


Signif. codes: @ ‘xxx’ 9.001 ‘**’ @.01 ‘* @.05 ‘.’ 0.1 ‘7 1 
(Dispersion parameter for gaussian family taken to be @.2078647) 


Number of Fisher Scoring iterations: 2 


Section 22.4: Checking for nonlinearity with polynomial 
regression 


Sometimes when working with linear regression we need to check for non-linearity in the data. One way to do this 
is to fit a polynomial model and check whether it fits the data better than a linear model. There are other reasons, 
such as theoretical, that indicate to fit a quadratic or higher order model because it is believed that the variables 
relationship is inherently polynomial in nature. 


Let's fit a quadratic model for the mtcars dataset. For a linear model see Linear regression on the mtcars dataset. 


First we make a Scatter plot of the variables mpg (Miles/gallon), disp (Displacement (cu.in.)), and wt (Weight (1000 
Ibs)). The relationship among mpg and disp appears non-linear. 


plot(mtcars[,c("mpg", "disp", "wt")]) 
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2 3 ‘ 
A linear fit will show that disp is not significant. 
fit@ = lm(mpg ~ wt+disp, mtcars) 
summary (fit@) 
# Coefficients: 
# Estimate Std. Error t value Pr(>|t|) 
#(Intercept) 34.96055 2.16454 16.151 4.91e-16 **x 
#wt -3 .35082 ie 16413" -=2878> 000743) 4% 
#disp =O. 01773 6500919 -1.929 @206362 - 


#Signif. codes: @ ‘***’ @.001 ‘**’ @.01 ‘*’ @.05 ‘.’ 0.1 ‘’ 1 
#Residual standard error: 2.917 on 29 degrees of freedom 
#Multiple R-squared: @.7809, Adjusted R-squared: 0.7658 


6 wD 6 Dw 


10 


Then, to get the result of a quadratic model, we added I(disp*2). The new model appears better when looking at 


R4‘2 and all variables are significant. 


fit1 = Im(mpg ~ wt+disp+tI(disp*2), mtcars) 
summary (fit1) 


# Coefficients: 


# Estimate Std. Error t value Pr(>|t]|) 

#(Intercept) 41.4019837 2.4266906 17.061 2.5e-16 *x* 
#wt -3.4179165 @.9545642 -3.581 0.001278 xx 
#disp -@.0823950 @.0182460 -4.516 0.000104 xxx 


#I(disp*2) @.0001277 @.0000328 3.892 8.000561 *xx 
#--- 

#signit. codes: (Ol xxx OGG -x* O00) “-*' @205 2° G1 1 
#Residual standard error: 2.391 on 28 degrees of freedom 
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#Multiple R-squared: 0.8578, Adjusted R-squared: 0.8426 


As we have three variables, the fitted model is a surface represented by: 


mpg = 41.4020-3.4179xwt-@.0824«disp+@.0001277xdisp*2 


Another way to specify polynomial regression is using poly with parameter raw=TRUE, otherwise orthogonal 
polynomials will be considered (see the help(ploy) for more information). We get the same result using: 


summary(1m(mpg ~ wt+poly(disp, 2, raw=TRUE) ,mtcars) ) 


Finally, what if we need to show a plot of the estimated surface? Well there are many options to make 3D plots in R. 
Here we use Fit3d from p3dpackage. 


library(p3d) 
Init3d(family="serif", cex = 1) 
Plot3d(mpg ~ disp+wt, mtcars) 
Axes3d() 

Fit3d(fit1) 


wt 


Section 22.5: Plotting The Regression (base) 


Continuing on the mtcars example, here is a simple way to produce a plot of your linear regression that is 
potentially suitable for publication. 


First fit the linear model and 
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fit <- lm(mpg ~ wt, data = mtcars) 
Then plot the two variables of interest and add the regression line within the definition domain: 


plot(mtcarsSwt, mtcars$mpg,pch=18, xlab = ‘wt',ylab = 'mpg') 
lines(c(min(mtcarsSwt ) ,max(mtcars$wt) ), 
as.numeric(predict(fit, data. frame(wt=c(min(mtcarsSwt),max(mtcarsSwt) ))))) 


Almost there! The last step is to add to the plot, the regression equation, the rsquare as well as the correlation 
coefficient. This is done using the vector function: 


rp = vector('expression’ ,3) 
rp[1] = substitute(expression(italic(y) == MYOTHERVALUE3 + MYOTHERVALUE4 %*% x), 
list(MYOTHERVALUE3 = format(fitScoefficients[1], digits = 2), 
MYOTHERVALUE4 = format(fitScoefficients[2], digits 

rp[2] = substitute(expression(italic(R)*2 == MYVALUE), 

list(MYVALUE = format(summary(fit)Sadj.r.squared, dig=3)))[2] 
rp[3] = substitute(expression(Pearson-R == MYOTHERVALUE2), 

list (MYOTHERVALUE2 = format(cor(mtcars$wt,mtcars$mpg), digits = 2)))[2] 


2)))[2] 


legend("topright", legend = rp, bty = 'n') 


Note that you can add any other parameter such as the RMSE by adapting the vector function. Imagine you want a 


legend with 10 elements. The vector definition would be the following: 
rp = vector('expression',1@) 
and you will need to defined r[1].... to r[18] 


Here is the output: 


y =37+-5.3xx 


R*=0.745 
Pearson — R =-0.87 


mpg 


wt 
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Section 22.6: Quality assessment 


After building a regression model it is important to check the result and decide if the model is appropriate and 
works well with the data at hand. This can be done by examining the residuals plot as well as other diagnostic plots. 


# fit the model 

fit <- 1Im(mpg ~ wt, data = mtcars) 
# 

par(mfrow=c(2,1)) 

# plot model object 

plot(fit, which =1:2) 
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Theoretical Quantiles 


These plots check for two assumptions that were made while building the model: 


1. That the expected value of the predicted variable (in this case mpg) is given by a linear combination of the 
predictors (in this case wt). We expect this estimate to be unbiased. So the residuals should be centered 
around the mean for all values of the predictors. In this case we see that the residuals tend to be positive at 
the ends and negative in the middle, suggesting a non-linear relationship between the variables. 

2. That the actual predicted variable is normally distributed around its estimate. Thus, the residuals should be 
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normally distributed. For normally distributed data, the points in a normal Q-Q plot should lie on or close to 
the diagonal. There is some amount of skew at the ends here. 
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Chapter 23: data.table 


Data.table is a package that extends the functionality of data frames from base R, particularly improving on their 
performance and syntax. See the package's Docs area at Getting started with data.table for details. 


Section 23.1: Creating a data.table 


A data.table is an enhanced version of the data.frame class from base R. As such, its class() attribute is the vector 
“data.table" “data.frame" and functions that work on a data.frame will also work with a data.table. There are 
many ways to create, load or coerce to a data.table. 


Build 
Don't forget to install and activate the data. table package 


library(data. table) 
There is a constructor of the same name: 


DT <- data. table( 
x = letters[1:5], 
Y= Se 
Za—2) (Glas) ees 


) 

# xX y Z 
#1: a 1 FALSE 
# 2 ib) 2) FALSE 
# 32 C3 FAUSE 
#4: d 4 TRUE 
#55 eC 5) TRUE 


Unlike data. frame, data.table will not coerce strings to factors: 
sapply(DT, class) 


# x y Z 
# "character" "integer" "logical" 


Read in 

We can read from a text file: 

dt <- fread("my_file.csv") 

Unlike read.csv, fread will read strings as strings, not as factors. 
Modify a data.frame 


For efficiency, data.table offers a way of altering a data.frame or list to make a data.table in-place (without making a 
copy or changing its memory location): 


# example data. frame 
DF <- data.frame(x = letters[1:5], y = 1:5, z = (1:5) > 3) 


# modification 
setDT (DF) 


Note that we do not <- assign the result, since the object DF has been modified in-place. The class attributes of the 
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data.frame will be retained: 


sapply(DF, class) 
# x y Zz 
# "factor" "integer" "logical" 


Coerce object to data.table 


If you have a list, data. frame, or data. table, you should use the setDT function to convert to a data.table 
because it does the conversion by reference instead of making a copy (which as.data.table does). This is 
important if you are working with large datasets. 


If you have another R object (such as a matrix), you must use as.data. table to coerce it to a data. table. 
mat <- matrix(@, ncol = 10, nrow = 10) 
DT <- as.data.table(mat) 


# or 
DT <- data.table(mat) 


Section 23.2: Special symbols in data.table 

SD 

.SD refers to the subset of the data. table for each group, excluding all columns used in by. 

.SD along with lapply can be used to apply any function to multiple columns by group in a data.table 
We will continue using the same built-in dataset, mtcars: 

mtcars = data.table(mtcars) # Let's not include rownames to keep things simpler 

Mean of all columns in the dataset by number of cylinders, cy1: 

mtcars[ , lapply(.SD, mean), by = cyl] 

# cyl mpg disp hp drat wt qsec vs am gear 


carb 


#1: 6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714 @.5714286 @.4285714 3.857143 
3.428571 


#2: 4 26.66364 105.1364 82.63636 4.070909 2.285727 19.13727 0.9898989 8.7272727 4.090909 
1.545455 


#3: 8 15.100080 353.1008 209.21429 3.229286 3.999214 16.77214 8.0000008 8.1428571 3.285714 
3.500008 


Apart from cyl, there are other categorical columns in the dataset such as vs, am, gear and carb. It doesn't really 
make sense to take the mean of these columns. So let's exclude these columns. This is where .SDcols comes into 
the picture. 


-SDcols 
.SDcols specifies the columns of the data. table that are included in .SD. 


Mean of all columns (continuous columns) in the dataset by number of gears gear, and number of cylinders, cyl, 
arranged by gear and cyL: 


# All the continuous variables in the dataset 
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cols_chosen <- c("mpg", "disp", "hp", "drat", "wt", "qsec") 


mtcars[order(gear, cyl), lapply(.SD, mean), by = .(gear, cyl), .SDcols = cols_chosen] 


# gear cyl mpg disp hp drat wt qsec 
#1: 3 4 21.500 120.1008 97.9000 3.700000 2.465000 20.0100 
#2: 3 6 19.758 241.5008 107.5000 2.920000 3.337500 19.8300 
#3: 3 8 15.0580 357.6167 194.1667 3.120833 4.104083 17.1425 
#4: 4 4 26.925 102.6250 76.0000 4.110000 2.378125 19.6125 
#5: 4 6 19.758 163.8000 116.5000 3.910008 3.093750 17.6700 
#6: 5 4 28.200 107.7000 102.0000 4.100000 1.826500 16.8000 
#7: 5 6 19.700 145.0000 175.0000 3.620000 2.770000 15.5000 
#8: 5 8 15.400 326.0000 299.5000 3.880000 3.370000 14.5500 


Maybe we don't want to calculate the mean by groups. To calculate the mean for all the cars in the dataset, we don't 


specify the by variable. 


mtcars[ , lapply(.SD, mean), .SDcols = cols_chosen] 


u mpg disp hp drat wt qsec 
#1: 20.09062 230.7219 146.6875 3.596563 3.21725 17.84875 


Note: 


e It is not necessary to define cols_chosen beforehand. .SDcols can directly take column names 


e .SDcols can also directly take a vector of columnnumbers. In the above example this would be mtcars[ 
lapply(.SD, mean), .SDcols = ¢(1,3:7)] 


-N 


.Nis shorthand for the number of rows in a group. 


iris[, .(count=.N), by=Species] 
# Species count 
#1: setosa 58 
#2: versicolor 50 
#3: virginica 50 


Section 23.3: Adding and modifying columns 


DT[where, select|update|do, by] syntax is used to work with columns of a data.table. 


e The "where" part is the i argument 
e The "select |update|do" part is the j argument 


These two arguments are usually passed by position instead of by name. 


Our example data below is 


mtcars = data.table(mtcars, keep.rownames = TRUE) 


Editing entire columns 
Use the := operator inside j to assign new columns: 


mtcars[, mpg_sq := mpg*2] 
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Remove columns by setting to NULL: 
mtcars[, mpg_sq := NULL] 


Add multiple columns by using the := operator's multivariate format: 


mtcars[, ~:='(mpg_sq = mpg’2, wt_sqrt = sqrt(wt))] 
# or 
mtcars[, c("mpg_sq", "“wt_sqrt") := .(mpg*2, sqrt(wt))] 


If the columns are dependent and must be defined in sequence, one way is: 
mtcars[, c("mpg_sq", “mpg2_hp") := .(temp1 <- mpg42, temp1/hp)] 

The .() syntax is used when the right-hand side of LHS := RHS is a list of columns. 
For dynamically-determined column names, use parentheses: 


vn = "mpg_sq" 
mtcars[, (vn) := mpg*2] 


Columns can also be modified with set, though this is rarely necessary: 


set(mtcars, j = "hp_over_wt", v = mtcarsShp/mtcarsSwt ) 


Editing subsets of columns 


Use the i argument to subset to rows "where" edits should be made: 


mtcars[1:3, newvar := "Hello" ] 
# or 
set(mtcars, j = "newvar", i = 1:3, v = "Hello") 


As in a data.frame, we can subset using row numbers or logical tests. It is also possible to use a "join" in i, but that 
more complicated task is covered in another example. 


Editing column attributes 


Functions that edit attributes, such as levels<- or names<-, actually replace an object with a modified copy. Even if 
only used on one column in a data.table, the entire object is copied and replaced. 


To modify an object without copies, use setnames to change the column names of a data.table or data.frame and 
setattr to change an attribute for any object. 


# Print a message to the console whenever the data.table is copied 
tracemem(mtcars) 
mtcars[, cyl2 := factor(cyl)] 


# Neither of these statements copy the data.table 
setnames(mtcars, old = "cyl2", new = "cyl_fac") 
setattr(mtcarsScyl_fac, "levels", e("four", "six", "“eight")) 
# Each of these statements copies the data.table 


names(mtcars) [names(mtcars) == "cyl_fac"] <- "cf" 
levels(mtcarsScf) <- c("IV", "VI", "VIII") 


Be aware that these changes are made by reference, so they are global. Changing them within one environment 
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affects the object in all environments. 


# This function also changes the levels in the global environment 
edit_levels <- function(x) setattr(x, "levels", c("low", "med", "high")) 
edit_levels(mtcarsScyl_factor ) 


Section 23.4: Writing code compatible with both data.frame 
and data.table 


Differences in subsetting syntax 


A data.table is one of several two-dimensional data structures available in R, besides data. frame, matrix and (2D) 
array. All of these classes use a very similar but not identical syntax for subsetting, the A[ rows, cols] schema. 


Consider the following data stored in a matrix, a data. frame and a data.table: 


ma <- matrix(rnorm(12), nrow=4, dimnames=list(letters[1:4], c('X', ‘Y', ‘Z'))) 

df <- as.data.frame(ma) 

dt <- as.data.table(ma) 

ma[2:3] #---> returns the 2nd and 3rd items, as if 'ma' were a vector (because it is!) 
df[2:3] #---> returns the 2nd and 3rd columns 

dt[2:3] #---> returns the 2nd and 3rd rows! 


If you want to be sure of what will be returned, it is better to be explicit. 


To get specific rows, just add a comma after the range: 


mai 23 ee \ 
df[2:3, ] # }---> returns the 2nd and 3rd rows 
dt 22350 / 


But, if you want to subset columns, some cases are interpreted differently. All three can be subset the same way 
with integer or character indices not stored in a variable. 


mal, 2:3] # \ 

df[, 2:3] # \ 

atl 233i] # }---> returns the 2nd and 3rd columns 
Mal eGaYe Ze ill Hs 9/ 

lll eC NE ZAI ie] 

Ct [iene (GN 28) | eae), 


However, they differ for unquoted variable names 


mycols <- 2:3 

ma[, mycols] # \ 

df[, mycols] # }---> returns the 2nd and 3rd columns 
dt[, mycols, with = FALSE] # / 


dt[, mycols] # ---> Raises an error 


In the last case, mycols is evaluated as the name of a column. Because dt cannot find a column named mycols, an 
error is raised. 


Note: For versions of the data. table package priorto 1.9.8, this behavior was slightly different. Anything in the 
column index would have been evaluated using dt as an environment. So both dt[, 2:3] anddt[, mycols] would 
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return the vector 2:3. No error would be raised for the second case, because the variable mycols does exist in the 
parent environment. 


Strategies for maintaining compatibility with data.frame and data.table 


There are many reasons to write code that is guaranteed to work with data. frame and data. table. Maybe you are 
forced to use data. frame, or you may need to share some code that you don't know how will be used. So, there are 
some main strategies for achieving this, in order of convenience: 


1. Use syntax that behaves the same for both classes. 

2. Use a common function that does the same thing as the shortest syntax. 

3. Force data. table to behave as data. frame (ex.: call the specific method print.data. frame). 
4. Treat them as list, which they ultimately are. 

5. Convert the table to a data. frame before doing anything (bad idea if it is a huge table). 

6. Convert the table to data.table, if dependencies are not a concern. 


Subset rows. Its simple, just use the [, ] selector, with the comma: 


A[1:18, ] 
A[ASvar > 17, ] # A[var > 17, ] just works for data.table 


Subset columns. If you want a single column, use the $ or the [[ ]] selector: 


ASvar 
colname <- ‘var 
A[[colname] ] 


A[[1]] 


If you want a uniform way to grab more than one column, it's necessary to appeal a bit: 


B <- “[.data.frame’(A, 2:4) 


# We can give it a better name 
select <- °[.data.frame~ 

B <- select(A, 2:4) 

C <- select(A, c('foo', ‘bar')) 


Subset 'indexed' rows. While data. frame has row.names, data. table has its unique key feature. The best thing is 
to avoid row.names entirely and take advantage of the existing optimizations in the case of data. table when 
possible. 


B <- A[ASvar != 8, ] 
Het Ollieete 
B <- with(A, A[var != @, ]) # data.table will silently index A by var before subsetting 


stuffi<=¢( a, 7c fs) 
C <- A[match(stuff, ASname), ] # really worse than: setkey(A); A[stuff, ] 


Get a 1-column table, get a row as a vector. These are easy with what we have seen until now: 


B <- select(A, 2) #---> a table with just the second column 
C <- unlist(A[1, ]) #---> the first row as a vector (coerced if necessary) 
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Section 23.5: Setting keys in data.table 
Yes, you need to SETKEY pre 1.9.6 


In the past (pre 1.9.6), your data. table was sped up by setting columns as keys to the table, particularly for large 
tables. [See intro vignette page 5 of September 2015 version, where speed of search was 544 times better.] You 
may find older code making use of this setting keys with 'setkey' or setting a 'key=' column when setting up the 
table. 


library(data.table) 
DT <- data. table( 
x = letters[1:5], 
Vo=7 Sa 
Za (Gioia 


) 

#> DT 

# ne Wi Zz 
#1: a 5 FALSE 
#2: b 4 FALSE 
#3: c 3 FALSE 
#4: d 2 TRUE 
#5: e 1 TRUE 


Set your key with the setkey command. You can have a key with multiple columns. 
setkey(DT, y) 


Check your table's key in tables() 


tables() 
> tables() 

NAME NROW NCOL MB COLS' KEY 
[1,] DT 5 3 XA, ZI, 
Total: 1MB 


Note this will re-sort your data. 


#> DT 

# ne Wi Zz 
#1: e 1 TRUE 
#2: d 2 TRUE 
#3: c 3 FALSE 
#4: b 4 FALSE 
#5: a 5 FALSE 


Now it is unnecessary 


Prior to v1.9.6 you had to have set a key for certain operations especially joining tables. The developers of 
data.table have sped up and introduced a "on=" feature that can replace the dependency on keys. See SO answer 
here for a detailed discussion. 


In Jan 2017, the developers have written a vignette around secondary indices which explains the "on" syntax and 
allows for other columns to be identified for fast indexing. 


Creating secondary indices? 
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In amanner similar to key, you can setindex(DT, key.col) or setindexv(DT, "“key.col.string"), where DT is 


your data.table. Remove all indices with setindex(DT, NULL). 
See your secondary indices with indices(DT). 


Why secondary indices? 


This does not sort the table (unlike key), but does allow for quick indexing using the "on" syntax. Note there can be 
only one key, but you can use multiple secondary indices, which saves having to rekey and resort the table. This will 


speed up your subsetting when changing the columns you want to subset on. 


Recall, in example above y was the key for table DT: 


DT 

#xXy Zz 
#1: e1 TRUE 
#2: d2 TRUE 
Hos Co sPALSE 
# 4: b 4 FALSE 
# 5 a 5 FAUSE 


# Let us set x as index 
setindex(DT, x) 


# Use indices to see what has been set 
indices(DT) 
# [1] ae 


# fast subset using index and not keyed column 
DT["c", on ="x"] 

#x y z 

#1: ¢ 3 FALSE 


# old way would have been rekeying DT from y to x, doing subset and 
# perhaps keying back to y (now we save two sorts) 


# This is a toy example above but would have been more valuable with big data sets 


Goalkicker.com - R Notes for Professionals 


94 


Chapter 24: Pivot and unpivot with 
data.table 


Parameter Details 
id.vars tell melt which columns to retain 


variable.name tell melt what to call the column with category labels 
value.name _ tell melt what to call the column that has values associated with category labels 
value.var tell dcast where to find the values to cast in columns 


tell dcast which columns to retain to form a unique record identifier (LHS) and which one holds the 


bormule category labels (RHS) 


fun.aggregate specify the function to use when the casting operation generates a list of values in each cell 


Section 24.1: Pivot and unpivot tabular data with data.table - | 
Convert from wide form to long form 


Load data USArrests from datasets. 


data("USArrests") 
head(USArrests) 


Murder Assault UrbanPop Rape 


Alabama ree: 236 58) 21,2 
Alaska 10.0 263 48 44.5 
Arizona tS 294 88 31.0 
Arkansas 8.8 190 50) 19.55 
California 9.0 276 91 40.6 
Colorado 9 204 78 3827 


Use ?USArrests to find out more. First, convert to data. table. The names of states are row names in the original 
data. frame. 


library(data.table) 
DT <- as.data.table(USArrests, keep.rownames=TRUE ) 


This is data in the wide form. It has a column for each variable. The data can also be stored in long form without 
loss of information. The long form has one column that stores the variable names. Then, it has another column for 
the variable values. The long form of USArrests looks like so. 


State Crime Rate 


1 Alabama Murder 13.2 
2 Alaska Murder 10.0 
on Arizona Murder 8.1 
4 Arkansas Murder 8.8 
ee California Murder 9.0 
196: Virginia Rape 20.7 
197: Washington Rape 26.2 
198: West Virginia Rape les) 
199%: Wisconsin Rape 10.8 
200: Wyoming Rape 15.6 


We use the melt function to switch from wide form to long form. 
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DTm <- melt(DT) 
names(DTm) <- c("State", "Crime", "Rate") 


By default, melt treats all columns with numeric data as variables with values. In USArrests, the variable UrbanPop 
represents the percentage urban population of a state. It is different from the other variables, Murder, Assault and 
Rape, which are violent crimes reported per 100,000 people. Suppose we want to retain UrbanPop column. We 
achieve this by setting id.vars as follows. 


DTmu <- melt(DT, id.vars=c("rn", "“UrbanPop" ), 
variable.name='Crime', value.name = "Rate") 
names(DTmu)[1] <- "State" 


Note that we have specified the names of the column containing category names (Murder, Assault, etc.) with 
variable .name and the column containing the values with value .name. Our data looks like so. 


State UrbanPop Crime Rate 


1 Alabama 58 Murder 13.2 
2 Alaska 48 Murder 10.0 
oe Arizona 80 Murder 8.1 
4 Arkansas 5@ Murder 8.8 
5 California 91 Murder 9.0 


Generating summaries with with split-apply-combine style approach is a breeze. For example, to summarize violent 
crimes by state? 


DTmu[, .(ViolentCrime = sum(Rate)), by=State] 
This gives: 


State ViolentCrime 


1 Alabama 270.4 
2 Alaska Shao 
3 Arizona 333-4 
ap: Arkansas DNS s 
5: California 32556 
6 Colorado 250.6 


oe 24.2: Pivot and unpivot tabular data with data.table - 


Convert from long form to wide form 


To recover data from the previous example, use dcast like so. 
DTc <- dcast(DTmu, State + UrbanPop ~ Crime) 
This gives the data in the original wide form. 


State UrbanPop Murder Assault Rape 


ie Alabama 58 132 236 21.2 
2 Alaska 48 10.0 263744 25 
sie Arizona 80 Sal 294 31.0 
ANS Arkansas 50 8.8 190 19.5 
Sir California 91 9.0 276 40.6 
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Here, the formula notation is used to specify the columns that form a unique record identifier (LHS) and the column 
containing category labels for new column names (RHS). Which column to use for the numeric values? By default, 
dcast uses the first column with numerical values left over when from the formula specification. To make explicit, 
use the parameter value. var with column name. 


When the operation produces a list of values in each cell, dcast provides a fun. aggregate method to handle the 
situation. Say | am interested in states with similar urban population when investigating crime rates. | add a column 
Decile with computed information. 


DTmu[, Decile := cut(UrbanPop, quantile(UrbanPop, probs = seq(@, 1, by=0.1)))] 
levels(DTmuSDecile) <- paste@(1:18, "D") 


Now, casting Decile ~ Crime produces multiple values per cell. | can use fun. aggregate to determine how these 
are handled. Both text and numerical values can be handle this way. 


dcast(DTmu, Decile ~ Crime, value.var="Rate", fun.aggregate=sum) 
This gives: 


dcast(DTmu, Decile ~ Crime, value.var="Rate", fun.aggregate=mean) 


This gives: 
State UrbanPop Crime Rate Decile 
iit Alabama 58 Murder 13.2 4D 
2 Alaska 48 Murder 10.0 2D 
Sn Arizona 80 Murder 8.1 8D 
4: Arkansas 5@ Murder 8.8 2D 
Be California 91 Murder 9.0 10D 


There are multiple states in each decile of the urban population. Use fun.aggregate to specify how these should be 
handled. 


dcast(DTmu, Decile ~ Crime, value.var="Rate", fun.aggregate=sum) 
This sums over the data for like states, giving the following. 


Decile Murder Assault Rape 


‘le: 1D 39.4 808 62.6 
Ze 2D 335.3 815 94.3 
3s 3D =. 22.6 451 67.7 
4: 4D 54.9 898 106.0 
aye SD 42.4 758 107.6 
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Chapter 25: Bar Chart 


The purpose of the bar plot is to display the frequencies (or proportions) of levels of a factor variable. For example, 
a bar plot is used to pictorially display the frequencies (or proportions) of individuals in various socio-economic 
(factor) groups(levels-high, middle, low). Such a plot will help to provide a visual comparison among the various 
factor levels. 


Section 25.1: barplot() function 


In barplot, factor-levels are placed on the x-axis and frequencies (or proportions) of various factor-levels are 
considered on the y-axis. For each factor-level one bar of uniform width with heights being proportional to factor 
level frequency (or proportion) is constructed. 


The barplot() function is in the graphics package of the R's System Library. The barplot() function must be 
supplied at least one argument. The R help calls this as heights, which must be either vector or a matrix. If it is 
vector, its members are the various factor-levels. 


To illustrate barplot(), consider the following data preparation: 


> grades<-c("At", "A-", "Bt", "B","C") 

> Marks<-sample(grades, 40, replace=T, prob=c(.2, .3, .25, .15, .1)) 

> Marks 

[a] Ae? WAS) SBE Aa) CARE UBS At BEY Aq SBE VAbD A= 
(13) *AS BEY VAS) Aa CAR NAS WA Re AE Ae Ce Ce 
(25) "BY BCE BR UCN NBA MBAR CUB eRe Age Bae Ae AS Ae 
[3 7eA =! SaB) ee CAS 


> 


A bar chart of the Marks vector is obtained from 


> barplot(table(Marks) ,main="Mid-Marks in Algorithms" ) 


Mid-Marks in Algorithms 


‘ue 


A- A+ B B+ Cc 


12 


8 


2 


Notice that, the barplot() function places the factor levels on the x-axis in the lexicographical order of the levels. 
Using the parameter names.arg, the bars in plot can be placed in the order as stated in the vector, grades. 
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# plot to the desired horizontal axis labels 
> barplot(table(Marks),names.arg=grades ,main="Mid-Marks in Algorithms") 


Mid-Marks in Algorithms 


A+ A- B+ B Cc 


10 12 


0 2 4 6 8 


Colored bars can be drawn using the col= parameter. 


> barplot(table(Marks),names.arg=grades,col = c("lightblue", 
“lightcyan", "lavender", "mistyrose", "cornsilk"), 
main="Mid-Marks in Algorithms") 


Mid-Marks in Algorithms 


ere 


A+ A- B+ B Cc 


10 12 


0 2 4 6 8 


A bar chart with horizontal bars can be obtained as follows: 


> barplot(table(Marks) , names.arg=grades,horiz=TRUE,col = c("lightblue", 
“lightcyan", "lavender", "mistyrose", "cornsilk"), 
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main="Mid-Marks in Algorithms") 


Mid-Marks in Algorithms 


A+ 


A bar chart with proportions on the y-axis can be obtained as follows: 


> barplot(prop.table(table(Marks) ), names.arg=grades,col = c("lightblue", 
“lightcyan", "lavender", "mistyrose", "cornsilk"), 
main="Mid-Marks in Algorithms") 


Mid-Marks in Algorithms 


a8 


A+ A- B+ B Cc 


0.10 0.20 0.30 


0.00 


The sizes of the factor-level names on the x-axis can be increased using cex.names parameter. 


> barplot(prop.table(table(Marks) ),names.arg=grades,col = c("lightblue", 
"lightcyan", "lavender", "mistyrose", "cornsilk") 
main="Mid-Marks in Algorithms" ,cex.names=2) 
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Mid-Marks in Algorithms 


° A+ A- B+ B C 


The heights parameter of the barplot() could be a matrix. For example it could be matrix, where the columns are 
the various subjects taken in a course, the rows could be the labels of the grades. Consider the following matrix: 


> gradTab 
Algorithms Operating Systems Discrete Math 
A- is 10 7 
At 18 7 2 
B 4 2 14 
B+ 8 19 1 
Cc 3 2: 5 


To draw a stacked bar, simply use the command: 


> barplot(gradTab,col = c("lightblue", "lightcyan", 
"lavender", "mistyrose", "cornsilk"),legend.text = grades, 
main="Mid-Marks in Algorithms") 
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Mid-Marks in Algorithms 


40 


30 


20 


10 


Algorithms Operating Systems Discrete Math 


To draw a juxtaposed bars, use the besides parameter, as given under: 


> barplot(gradTab, beside = T,col = c("lightblue", "lightcyan", 
"lavender", "mistyrose", "cornsilk"),legend.text = grades, 
main="Mid-Marks in Algorithms") 


Mid-Marks in Algorithms 


onl 


Algorithms Operating Systems Discrete Math 


15 
o0000 
@ 

+ 


10 


5 


A horizontal bar chart can be obtained using horiz=T parameter: 
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> barplot(gradTab, beside = T,horiz=T,col = c("lightblue", "lightcyan", 
"lavender", "mistyrose", "cornsilk"),legend.text = grades, 
cex.names=.75,main="Mid-Marks in Algorithms") 


Mid-Marks in Algorithms 


booo0o 
ee) 
+ 


Operating Systems Discrete Math 


Algorithms 
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Chapter 26: Base Plotting 


Parameter 


x 


type 


Details 
x-axis variable. May supply either dataSvariablex or data[, x] 


y-axis variable. May supply either dataSvariabley or data[,y] 
Main title of plot 

Optional subtitle of plot 

Label for x-axis 

Label for y-axis 

Integer or character indicating plotting symbol 

Integer or string indicating color 


Type of plot. "p" for points, "1" for lines, "b" for both, “c" for the lines part alone of "b", “o" for both 
‘overplotted’, "h" for ‘histogram’-like (or ‘high-density’) vertical lines, "s" for stair steps, "S" for other 
steps, “n" for no plotting 


Section 26.1: Density plot 


Avery useful and logical follow-up to histograms would be to plot the smoothed density function of a random 
variable. A basic plot produced by the command 


plot(density(rnorm(10@)),main="Normal density", xlab="x") 


would look like 


Density 


04 


0.1 0.2 0.3 


0.0 


Normal density 


-3 -2 -1 0 1 2 3 


You can overlay a histogram and a density curve with 


x=rnorm(10@) 
hist(x, prob=TRUE, main="Normal density + histogram") 
lines(density(x), lty="dotted",col="red") 
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which gives 


Normal density + histogram 


Density 
0.1 0.2 0.3 0.4 


0.0 


Section 26.2: Combining Plots 


It's often useful to combine multiple plot types in one graph (for example a Barplot next to a Scatterplot.) R makes 
this easy with the help of the functions par() and layout(). 


par() 


par uses the arguments mf row or mfcol to create a matrix of nrows and ncols c(nrows, ncols) which will serve as 
a grid for your plots. The following example shows how to combine four plots in one graph: 


par(mfrow=c(2,2)) 

plot(cars, main="Speed vs. Distance") 
hist(carsSspeed, main="Histogram of Speed") 
boxplot(carsS$dist, main="Boxplot of Distance") 
boxplot(carsSspeed, main="Boxplot of Speed") 
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Speed vs. Distance Histogram of Speed 
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layout() 


The layout() is more flexible and allows you to specify the location and the extent of each plot within the final 
combined graph. This function expects a matrix object as an input: 


layout(matrix(c(1,1,2,3), 2,2, byrow=T)) 
hist(carsSspeed, main="Histogram of Speed") 
boxplot(carsS$dist, main="Boxplot of Distance") 
boxplot(carsSspeed, main="Boxplot of Speed") 
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Histogram of Speed 
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Section 26.3: Getting Started with R_Plots 


e Scatterplot 


You have two vectors and you want to plot them. 


x_values <- rnorm(n = 2@ , mean = 5 , sd = 8) #20 values generated from Normal(5, 8) 
y_values <- rbeta(n = 2@ , shapel = 500 , shape2 = 10) #20 values generated from Beta(50@0,10) 


If you want to make a plot which has the y_values in vertical axis and the x_valuesin horizontal axis, you can use 
the following commands: 


plot(x = x_values, y = y_values, type = "p") #standard scatter-plot 
plot(x = x_values, y = y_values, type = "1") # plot with lines 
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plot(x = x_values, y = y_values, type = "n") # empty plot 


You can type ?plot() in the console to read about more options. 
¢ Boxplot 
You have some variables and you want to examine their Distributions 


#boxplot is an easy way to see if we have some outliers in the data. 


z<- rbeta(2@ , 500 , 10) #generating values from beta distribution 
z[c(19 , 20)] <- c(@.97 , 1.05) # replace the two last values with outliers 
boxplot(z) # the two points are the outliers of variable z. 


e Histograms 
Easy way to draw histograms 


hist(x = x_values) # Histogram for x vector 
hist(x = x_values, breaks = 3) #use breaks to set the numbers of bars you want 


e Pie_charts 
If you want to visualize the frequencies of a variable just draw pie 


First we have to generate data with frequencies, for example : 


Poss c(rep( A 3), repC® . 18) rept c | 7) ) 
t <- table(P) # this is a frequency matrix of variable P 
pie(t) # And this is a visual version of the matrix above 


Section 26.4: Basic Plot 


A basic plot is created by calling plot(). Here we use the built-in cars data frame that contains the speed of cars 
and the distances taken to stop in the 1920s. (To find out more about the dataset, use help(cars)). 


plot(x = carsSspeed, y = carsSdist, pch = 1, col = 1, 
main = "Distance vs Speed of Cars", 
xlab = "Speed", ylab = "Distance") 


Goalkicker.com - R Notes for Professionals 108 


Distance to stop vs Speed of Cars 


Distance 
40 60 80 100 120 


20 


5 10 15 20 25 


Speed 


We can use many other variations in the code to get the same result. We can also change the parameters to obtain 
different results. 


with(cars, plot(dist~speed, pch = 2, col = 3, 
main = "Distance to stop vs Speed of Cars", 
xlab = "Speed", ylab = "Distance")) 
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Distance to stop vs Speed of Cars 
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20 
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Additional features can be added to this plot by calling points(), text(), mtext(), lines(), grid(), etc. 


plot(dist~speed, pch = "*", col = "magenta", data=cars, 
main = "Distance to stop vs Speed of Cars", 
xlab = "Speed", ylab = "Distance") 

mtext("In the 192@s.") 

grid(,col="lightblue" ) 
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Distance to stop vs Speed of Cars 
in the 1920s. 
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Section 26.5: Histograms 
Histograms allow for a pseudo-plot of the underlying distribution of the data. 


hist(1deaths) 
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hist(ldeaths, breaks = 20, freq = F, col = 3) 
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Density 
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Section 26.6: Matplot 
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i a as Ss a | 
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matplot is useful for quickly plotting multiple sets of observations from the same object, particularly from a matrix, 


on the same graph. 


Here is an example of a matrix containing four sets of random draws, each with a different mean. 


xmat <- cbind(rnorm(10@, -3), rnorm(1@@, -1), rnorm(1@@, 1), rnorm(1@@, 3)) 


head(xmat ) 
[1] 

[1,] -3.072793 -2. 
[2,] -3.702545 -1. 
[3,] -2.890698 -1. 
[4, ] =3.4311133 =2. 
[5,] -4.532925 @. 
[6,] -2.169391 -1. 


He oR RH HH 


[2] 


53111494 @. 
-@ 


42789347 
88476126 
02626870 
02164187 
42699116 


1 


[,3] 
6168063 


-2197196 
-9586467 
1.1153643 
a. 

@.3214854 


9783948 


ees 


. 780465 
-478416 
.268474 
. 170689 
-162121 
-480305 


One way to plot all of these observations on the same graph is to do one plot call followed by three more points 


or lines calls. 


plot(xmat[,1], type 
lines(xmat[,2], col 
lines(xmat[,3], col 
lines(xmat[,4], col 


ll) 
‘red') 


= 'green') 


= 'blue') 


Goalkicker.com - R Notes for Professionals 


113 


xmat{, 1] 


0 20 40 60 80 100 


Index 


However, this is both tedious, and causes problems because, among other things, by default the axis limits are 
fixed by plot to fit only the first column. 


Much more convenient in this situation is to use the matplot function, which only requires one call and 


automatically takes care of axis limits and changing the aesthetics for each column to make them distinguishable. 


matplot(xmat, type = '1') 


Goalkicker.com - R Notes for Professionals 


14 


v7 
‘ 
i 
i 
i 
i 
‘ 
i 
t 


xmat 
0 


0 20 40 60 80 100 


Note that, by default, matplot varies both color (col) and linetype (1ty) because this increases the number of 


possible combinations before they get repeated. However, any (or both) of these aesthetics can be fixed to a single 
value... 


matplot(xmat, type = '1', col = ‘black’ ) 
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yewx 
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..or a Custom vector (which will recycle to the number of columns, following standard R vector recycling rules). 


‘orange')) 


‘blue’, 


‘green’, 


matplot(xmat, type = '1', col = c('red', 
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xmat 
-2 0 


4 


0 20 40 60 80 100 


Standard graphical parameters, including main, xlab, xmin, work exactly the same way as for plot. For more on 
those, see ?par. 


Like plot, if given only one object, matplot assumes it's the y variable and uses the indices for x. However, x and y 
can be specified explicitly. 


matplot(x = seq(@, 108, length.out = 100), y = xmat, type='1') 
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10 


seq(0. 10, length.out = 100) 


In fact, both x and y can be matrices. 


xes <- cbind(seq(9, 18, length.out = 10@), 


seq(2.5, 12.5, length.out = 10@), 
seq(5, 15, length.out = 10@), 


seq(7.5, 17.5, length.out = 10@)) 


xmat, type = ‘1') 


y= 


xes, 


matplot(x 
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Section 26.7: Empirical Cumulative Distribution Function 


Avery useful and logical follow-up to histograms and density plots would be the Empirical Cumulative Distribution 
Function. We can use the function ecdf() for this purpose. A basic plot produced by the command 


plot(ecdf(rnorm(10@)),main="Cumulative distribution", xlab="x") 


would look like 
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Chapter 27: boxplot 


Parameters 


formula 


data 
subset 


na.action 
boxwex 
plot 


col 


Details (Source R Documentation) 


a formula, such as y ~ grp, where y is a numeric vector of data values to be split into groups according 
to the grouping variable grp (usually a factor). 


a data.frame (or list) from which the variables in formula should be taken. 
an optional vector specifying a subset of observations to be used for plotting. 


a function which indicates what should happen when the data contain NAs. The default is to ignore 
missing values in either the response or the group. 


a scale factor to be applied to all boxes. When there are only a few groups, the appearance of the plot 
can be improved by making the boxes narrower. 


if TRUE (the default) then a boxplot is produced. If not, the summaries which the boxplots are based 
on are returned. 


if col is non-null it is assumed to contain colors to be used to colour the bodies of the box plots. By 
default they are in the background colour. 


Section 27.1: Create a box-and-whisker plot with boxplotQ) 
{graphics} 


This example use the default boxplot() function and the irisdata frame. 


> head(iris) 
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 


aAuoaRWDN 


5a SS ete) 1.4 @.2 setosa 
4.9 320 4 @.2 setosa 
AT. She eo @.2 setosa 
4.6 Bel AVES @.2 setosa 
5.0 3.6 14. @.2 setosa 
5s 4 3.9 dey, @.4 setosa 


Simple boxplot (Sepal.Length) 


Create a box-and-whisker graph of a numerical variable 


boxplot(iris[,1],xlab="Sepal.Length", ylab="Length(in centemeters)", 


main="Summary Charateristics of Sepal.Length(Iris Data)") 
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Boxplot of sepal length grouped by species 
Create a boxplot of a numerical variable grouped by a categorical variable 


boxplot(Sepal.Length~Species, data = iris) 


6.5 75 


6.5 


45 


setosa versicolor virginica 


Bring order 


To change order of the box in the plot you have to change the order of the categorical variable's levels. 
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For example if we want to have the order virginica - versicolor - setosa 


newSpeciesOrder <- factor(iris$Species, levels=c("virginica", "versicolor", 
boxplot(Sepal.Length~newSpeciesOrder,data = iris) 


setosa")) 


75 


6.5 


5.5 


45 


virginica versicolor setosa 


Change groups names 


If you want to specifie a better name to your groups you can use the Names parameter. It take a vector of the size of 
the levels of categorical variable 


boxplot(Sepal.Length~newSpeciesOrder,data = iris,names= c("name1", "name2", "name3")) 
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Small improvements 
Color 


col: add a vector of the size of the levels of categorical variable 


boxplot(Sepal.Length~Species, data = iris,col=c("green", "yellow", "orange")) 


6.5 75 


5.5 


45 


setosa versicolor virginica 


Proximity of the box 
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boxwex: set the margin between boxes. 
Left boxplot(Sepal.Length~Species, data = iris, boxwex = @.1) 
Right boxplot(Sepal.Length~Species,data = iris,boxwex = 1) 


sélosa versicolor virginica sélosa versicolor 


bowwax = 0.1 boxwex = 1 


See the summaries which the boxplots are based plot=FALSE 


To see a summary you have to put the paramater plot to FALSE. 
Various results are given 


> boxplot(Sepal.Length~newSpeciesOrder, data = iris, plot=FALSE) 
Sstats #summary of the numerical variable for the 3 groups 


ReMi Nie) 


[1,] 5.6 4.9 4.3 # extreme value 
(2A) Ge2 506) 4.8 4 first quantile: damit 
[3 ]) 62.5 529 5.6) # median lamit 
[4,] 6.9 6.3 5.2 # third quartile limit 
[5,] 7.9 7.0 5.8 # extreme value 


Sn #number of observations in each groups 
[1] 50 58 50 


Sconf #extreme value of the notchs 
ea 2] [3] 

[1,] 6.343588 5.743588 4.910622 

[2,] 6.656412 6.056412 5.089378 


Sout #extreme value 
Petes) 


Sgroup #group in which are the extreme value 


1 


Snames #groups names 
[1] "virginica" "versicolor 


setosa" 


Section 27.2: Additional boxplot style parameters 


Box 


e boxlty - box line type 
boxlwd - box line width 
e boxcol - box line color 
e boxfill - box fill colors 


Median 
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¢ medity - median line type ("blank" for no line) 
¢ medliwd - median line widht 

e medcol - median line color 

e¢ medpch - median point (NA for no symbol) 

¢ medcex - median point size 

e medbg - median point background color 


Whisker 


e whisklty - whisker line type 
e whisklwd - whisker line width 
e whiskcol - whisker line color 


Staple 


e staplelty - staple line type 
e staplelwd - staple line width 
e staplecol - staple line color 


Outliers 


outity - outlier line type ("blank" for no line) 

e outlwd - outlier line width 

outcol - outlier line color 

e outpch - outlier point type (NA for no symbol) 
¢ outcex - outlier point size 

¢ outbg - outlier point background color 


Example 


Default and heavily modified plots side by side 


par(mfrow=c(1,2)) 

# Default 

boxplot(Sepal.Length ~ Species, data=iris) 

# Modified 

boxplot(Sepal.Length ~ Species, data=iris, 
boxlty=2, boxlwd=3, boxfill="cornflowerblue", boxcol="darkblue", 
medity=2, medlwd=2, medcol="red", medpch=21, medcex=1, medbg="white", 
whisklty=2, whisklwd=3, whiskcol="darkblue", 
staplelty=2, staplelwd=2, staplecol="red", 
outlty=3, outlwd=3, outcol="grey", outpch=NA 
) 
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45 50 55 60 65 70 7.5 8.0 
45 50 55 60 65 70 7.5 8.0 


setosa versicolor virginica setosa versicolor virginica 


GoalKicker.com - R Notes for Professionals 127 


Chapter 28: ggplot2 


Section 28.1: Displaying multiple plots 


Display multiple plots in one image with the different facet functions. An advantage of this method is that all axes 
share the same scale across charts, making it easy to compare them at a glance. We'll use the mpg dataset included 


in ggplot2. 


Wrap charts line by line (attempts to create a square layout): 


ggplot(mpg, aes(x = displ, y = hwy)) + 
geom_point() + 
facet_wrap(~class) 
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eo 
s 
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e 
2 3 4 5 6 7 2 3 4 
disp! 


Display multiple charts on one row, multiple columns: 


ggplot(mpg, aes(x = displ, y = hwy)) + 
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geom_point() + 
facet_grid(.~class) 
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Display multiple charts on one column, multiple rows: 


ggplot(mpg, aes(x = displ, y = hwy)) + 
geom_point() + 
facet_grid(class~.) 
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Display multiple charts in a grid by 2 variables: 


ggplot(mpg, aes(x = displ, y = hwy)) + 
geom_point() + 
facet_grid(trans~class) #"row" parameter, then "column" parameter 
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Section 28.2: Prepare your data for plotting 


ggplot2 works best with a long data frame. The following sample data which represents the prices for sweets on 20 
different days, in a format described as wide, because each category has a column. 


set.seed(47) 


sweetsWide <- data.frame(date = 120) 
chocolate = runif(2@, min = 2, max = 4), 
iceCream = runif(20, min = 0.5, max = 1), 
candy = runif(20, min = 1, max = 3)) 

head(sweetsWide) 

## date chocolate iceCream candy 

## 1 i) 323953924)6) 5896727) (Weald73ait 

HH 2 2 2.747832 0.7783982 1.740851 

## 3 3 3.523004 @.7578975 2.196754 

## A 4 3.644983 @.5667152 2.875028 

## S 5 3.147089 @.8446417 1.733543 

## 6 6 3.382825 8.6900125 1.405674 


To convert sweet sWide to long format for use with ggplot2, several useful functions from base R, and the packages 
reshape2, data.table and tidyr (in chronological order) can be used: 


# reshape from base R 
sweetsLong <- reshape(sweetsWide, idvar = ‘date’, direction = ‘long’, 
varying = list(2:4), new.row.names = NULL, times = names(sweetsWide)[-1]) 


# melt from 'reshape2' 


library(reshape2) 
sweetsLong <- melt(sweetsWide, id.vars = ‘date') 
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# melt from ‘data.table' 

# which is an optimized & extended version of ‘melt' from 'reshape2' 
library(data. table) 

sweetsLong <- melt(setDT(sweetsWide), id.vars = ‘date') 


# gather from ‘tidyr' 


library(tidyr ) 
sweetsLong <- gather(sweetsWide, sweet, price, chocolate:candy) 


The all give a similar result: 


head(sweetsLong) 

## date sweet price 
## 1 1 chocolate 3.953924 
#H 2 2 chocolate 2.747832 
## 3 3 chocolate 3.523004 
## 4 4 chocolate 3.644983 
## 5 5 chocolate 3.147089 
## 6 6 chocolate 3.382825 


See also Reshaping data between long and wide forms for details on converting data between /ong and wide format. 


The resulting sweetsLong has one column of prices and one column describing the type of sweet. Now plotting is 
much simpler: 


library(ggplot2) 
ggplot(sweetsLong, aes(x = date, y = price, colour = sweet)) + geom_line() 


4 - 
3 - 
sweet 
ro) —— candy 
& 
oS — chocolate 
2- —— iceCream 


date 
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Section 28.3: Add horizontal and vertical lines to plot 


Add one common horizontal line for all categorical variables 


# sample data 
df <- data.frame(x=('A', 'B'), y = ¢(3, 4)) 


pl <- ggplot(df, aes(x=x, y=y)) 
+ geom_bar(position = "dodge", stat = ‘identity') 


+ theme_bw() 


p1 + geom_hline(aes(yintercept=5), colour="#998000", linetype="dashed") 


Add one horizontal line for each categorical variable 


# sample data 
df <- data.frame(x=('A', 'B'), y = ¢(3, 4)) 


# add horizontal levels for drawing lines 
dfShval <- dfSy + 2 


pl <- ggplot(df, aes(x=x, y=y)) 
+ geom_bar(position = "dodge", stat = ‘identity') 
+ theme_bw() 


p1 + geom_errorbar(aes(y=hval, ymax=hval, ymin=hval), colour="#990000", width=@.75) 
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Add horizontal line over grouped bars 


# sample data 

df <- data.frame(x = rep(c('A', 'B'), times=2), 
group = rep(c('G1', 'G2'), each=2), 
Vea C3. 40 56 6)e 
hval = c(5, 6, 7, 8)) 


pl <- ggplot(df, aes(x=x, y=y, fill=group) ) 
+ geom_bar(position="dodge", stat="identity") 


p1 + geom_errorbar(aes(y=hval, ymax=hval, ymin=hval), 
colour="#990000" , 
position = "dodge", 
linetype = "dashed") 
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Add vertical line 


# sample data 

df <- data.frame(group=rep(c('A', 'B'), each=2@), 
x = rnorm(4@, 5, 2), 
y = rnorm(4@, 10, 2)) 


p1 <- ggplot(df, aes(x=x, y=y, colour=group)) + geom_point() 


p1 + geom_vline(aes(xintercept=5), color="#990000", linetype="dashed" ) 


1 
=o es 
1 
1 
1 te bd 
ee st ° 
e i* ee 
ee & hal 
10- ° q 
eee 
os ae E group 
1 e 
> i s . eA 
8 - °*B 
© : t e 
ae | 
1 
1 
‘ ° 
5- 1 
1 
1 
1 
1 
1 
se 1 
‘ 1 ‘ 
25 5.0 75 
xX 


Goalkicker.com - R Notes for Professionals 


145 


Section 28.4: Scatter Plots 
We plot a simple scatter plot using the builtin iris data set as follows: 


library(ggplot2) 
ggplot(iris, aes(x = Petal.Width, y = Petal.Length, color = Species)) + 
geom_point() 


This gives: 
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Section 28.5: Produce basic plots with qplot 


Species 


® setosa 


® versicolor 


® virginica 


qplot is intended to be similar to base r plot() function, trying to always plot out your data without requiring too 


much specifications. 
basic qplot 


qplot(x = disp, y = mpg, data = mtcars) 
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adding colors 


qplot(x = disp, y = mpg, colour = cyl,data = mtcars) 


e 
es 
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adding a smoother 


qplot(x = disp, y = mpg, geom = c("point", "smooth"), data = mtcars) 
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Section 28.6: Vertical and Horizontal Bar Chart 


ggplot(data = diamonds, aes(x = cut, fill =color)) + 
geom_bar(stat = "count", position = "dodge") 
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it is possible to obtain an horizontal bar chart simply adding coord_flip() aesthetic to the ggplot object 
ggplot(data = diamonds, aes(x = 


= cut, fill =color)) + 
geom_bar(stat = "count", position = "dodge")+ 
coord_flip() 
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Section 28.7: Violin plot 


Violin plots are kernel density estimates mirrored in the vertical plane. They can be used to visualize several 
distributions side-by-side, with the mirroring helping to highlight any differences. 


ggplot(diamonds, aes(cut, price)) + 
geom_violin() 
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Violin plots are named for their resemblance to the musical instrument, this is particularly visible when they are 
coupled with an overlaid boxplot. This visualisation then describes the underlying distributions both in terms of 
Tukey's 5 number summary (as boxplots) and full continuous density estimates (violins). 


ggplot(diamonds, aes(cut, price)) + 
geom_violin() + 
geom_boxplot(width .1, fill = "black", outlier.shape = NA) + 
stat_summary(fun.y = "median", geom = "point", col = "white") 
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Chapter 29: Factors 


Section 29.1: Consolidating Factor Levels with a List 


There are times in which it is desirable to consolidate factor levels into fewer groups, perhaps because of sparse 
data in one of the categories. It may also occur when you have varying spellings or capitalization of the category 
names. Consider as an example the factor 


set.seed(1) 

colorful <- sample(c("red", "Red", "RED", "blue", "Blue", "BLUE", "green", "gren") 
size = 20, 
replace = TRUE) 

colorful <- factor(colorful) 


Since R is case-sensitive, a frequency table of this vector would appear as below. 


table(colorful) 
colorful 
blue Blue BLUE green’ gren red Red RED 
3 1 4 z 4 1 3 Z 


This table, however, doesn't represent the true distribution of the data, and the categories may effectively be 
reduced to three types: Blue, Green, and Red. Three examples are provided. The first illustrates what seems like an 
obvious solution, but won't actually provide a solution. The second gives a working solution, but is verbose and 
computationally expensive. The third is not an obvious solution, but is relatively compact and computationally 
efficient. 


Consolidating levels using factor (factor_approach) 


factor(as.character(colorful), 
levels = c( blue; “Blues.  BRUE.,) green’, «nen. aned..) JRed =) (RED: )y, 
labels = c("Blue", "Blue", "Blue", "Green", "Green", "Red", "Red", "Red")) 


[1] Green Blue Red Red Blue Red Red Red Blue Red Green Green Green Blue 
Red Green 
[17] Red Green Green Red 
Levels: Blue Blue Blue Green Green Red Red Red 
Warning message: 
In ‘levels<-*(**tmp**, value = if (nl == nL) as.character(labels) else paste@(labels, 
duplicated levels in factors are deprecated 


Notice that there are duplicated levels. We still have three categories for "Blue", which doesn't complete our task of 
consolidating levels. Additionally, there is a warning that duplicated levels are deprecated, meaning that this code 
may generate an error in the future. 


Consolidating levels using ifelse (ifelse_approach) 
factor(ifelse(colorful %in% c("blue", "Blue", "BLUE"), 


"Blue", 

ifelse(colorful %in% c("green", "“gren") 
"Green", 
"Red"))) 


[1] Green Blue Red Red Blue Red Red Red Blue Red Green Green Green Blue 
Red Green 
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[17] Red Green Green Red 
Levels: Blue Green Red 


This code generates the desired result, but requires the use of nested ifelse statements. While there is nothing 
wrong with this approach, managing nested ifelse statements can be a tedious task and must be done carefully. 


Consolidating Factors Levels with a List (list_approach) 


A less obvious way of consolidating levels is to use a list where the name of each element is the desired category 
name, and the element is a character vector of the levels in the factor that should map to the desired category. This 
has the added advantage of working directly on the levels attribute of the factor, without having to assign new 
objects. 


levels(colorful) <- 
list("Blue™ = c("blue", “Blue”, “BLUE” ) 
"Green" = e("green’, "“gren") 
"Red" = c("red", "Red", "RED")) 


[1] Green Blue Red Red Blue Red Red Red Blue Red Green Green Green Blue 
Red Green 
[17] Red Green Green Red 
Levels: Blue Green Red 


Benchmarking each approach 


The time required to execute each of these approaches is summarized below. (For the sake of space, the code to 
generate this summary is not shown) 


Unit: microseconds 
expr min lq mean median uq max neval cld 
factor 78.725 83.256 93.26023 87.5030 97.131 218.899 100 b 
ifelse 104.494 107.609 123.53793 113.4145 128.281 254.580 100 
list_approach 49.557 52.955 6@.50756 54.9378 65.132 138.193 100 a 


The list approach runs about twice as fast as the ifelse approach. However, except in times of very, very large 
amounts of data, the differences in execution time will likely be measured in either microseconds or milliseconds. 
With such small time differences, efficiency need not guide the decision of which approach to use. Instead, use an 
approach that is familiar and comfortable, and which you and your collaborators will understand on future review. 


Section 29.2: Basic creation of factors 


Factors are one way to represent categorical variables in R. A factor is stored internally as a vector of integers. The 
unique elements of the supplied character vector are known as the /evels of the factor. By default, if the levels are 
not supplied by the user, then R will generate the set of unique values in the vector, sort these values 
alphanumerically, and use them as the levels. 


charvar <- rep(c("n", "c"), each = 3) 
f <- factor(charvar) 
if 
levels(f) 
> f 


[1] nnnece 
Levels: cn 
> levels(f) 
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ee a 
If you want to change the ordering of the levels, then one option to to specify the levels manually: 


levels(factor(charvar, levels = c("n","c"))) 


> levels(factor(charvar, levels = c("n","c"))) 
[itl caiieae 


Factors have a number of properties. For example, levels can be given labels: 


> f <- factor(charvar, levels=c("n", "c"), labels=c("Newt", "Capybara")) 
> f 
[1] Newt Newt Newt Capybara Capybara Capybara 


Levels: Newt Capybara 


Another property that can be assigned is whether the factor is ordered: 


> Weekdays <- factor(c("Monday", "Wednesday", "Thursday", "Tuesday", "Friday", "Sunday", 
"Saturday")) 

> Weekdays 

[1] Monday Wednesday Thursday Tuesday Friday Sunday Saturday 

Levels: Friday Monday Saturday Sunday Thursday Tuesday Wednesday 

> Weekdays <- factor(Weekdays, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", 
"Saturday", "Sunday"), ordered=TRUE) 

> Weekdays 

[1] Monday Wednesday Thursday Tuesday’ Friday Sunday Saturday 

Levels: Monday < Tuesday < Wednesday < Thursday < Friday < Saturday < Sunday 


When a level of the factor is no longer used, you can drop it using the droplevels() function: 


> Weekend <- subset(Weekdays, Weekdays == "Saturday" | Weekdays == "Sunday") 
> Weekend 

[1] Sunday Saturday 

Levels: Monday < Tuesday < Wednesday < Thursday < Friday < Saturday < Sunday 
> Weekend <- droplevels(Weekend) 

> Weekend 

[1] Sunday Saturday 

Levels: Saturday < Sunday 


Section 29.3: Changing and reordering factors 


When factors are created with defaults, levels are formed by as.character applied to the inputs and are ordered 
alphabetically. 


charvar <- rep(c("W", "n", "c"), times=c(17,20, 14) ) 
f <- factor(charvar) 

levels(f) 

ce a) ee ae 


In some situations the treatment of the default ordering of levels (alphabetic/lexical order) will be acceptable. For 
example, if one justs want to plot the frequencies, this will be the result: 


plot(f,col=1:length(levels(f) )) 
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But if we want a different ordering of levels, we need to specify this in the levels or labels parameter (taking 
care that the meaning of "order" here is different from ordered factors, see below). There are many alternatives to 
accomplish that task depending on the situation. 


1. Redefine the factor 


When it is possible, we can recreate the factor using the levels parameter with the order we want. 


ff <- factor(charvar, levels = c("n", "W", "c")) 
levels(ff) 
Me) eames wine ny 


" 
oO 
pS 
= 
o. 
+ 
— 
~~ 


gg <- factor(charvar, levels 
levels(gg) 
# ea "w" eu Oe 


When the input levels are different than the desired output levels, we use the labels parameter which causes the 
levels parameter to become a "filter" for acceptable input values, but leaves the final values of "levels" for the 
factor vector as the argument to labels: 


fm <- factor(as.numeric(f),levels = c(2,3,1), 
labels = c("nn", "WW", "cc")) 

levels(fm) 

7 nn WW coe 


fm <- factor(LETTERS[1:6], levels = LETTERS[1:4], # only ‘A'-'D' as input 


labels = letters[1:4]) # but assigned to ‘a'-'d' 
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fm 
call) se) b c d <NA> <NA> 
#Levels: abcd 


2. Use relevel function 


When there is one specific level that needs to be the first we can use relevel. This happens, for example, in the 
context of statistical analysis, when a base category is necessary for testing hypothesis. 


g<-relevel(f, "n") # moves n to be the first level 
levels(g) 
# fa Wale eu "w" 


As can be verified f and g are the same 
all.equal(f, g) 
# [1] "Attributes: < Component “levels”: 2 string mismatches >" 


all.equal(f, g, check.attributes = F) 
# [1] TRUE 


3. Reordering factors 


There are cases when we need to reorder the levels based on a number, a partial result, a computed statistic, or 
previous calculations. Let's reorder based on the frequencies of the levels 


table(g) 
#9 

# noc W 
# 20 14 17 


The reorder function is generic (see help( reorder )), but in this context needs: x, in this case the factor; X, a 
numeric value of the same length as x; and FUN, a function to be applied to X and computed by level of the x, which 
determines the levels order, by default increasing. The result is the same factor with its levels reordered. 


g.ord <- reorder(g,rep(1,length(g)), FUN=sum) #increasing 
levels(g.ord) 
# | Myth "w" arate 


To get de decreasing order we consider negative values (-1) 


g.ord.d <- reorder(g, rep(-1,length(g)), FUN=sum) 
levels(g.ord.d) 
# [1] Mahe ur D eu 


Again the factor is the same as the others. 


data. frame(f,g,g.ord,g.ord.d)[seq(1,length(g),by=5),] #just same lines 


# fg) g-ord g-ordsd 
#1 WW W W 
#6 WW W W 
#11WW W W 
#16WW W W 
#21 n0nn n n 
# 26 nn n n 
# 31 nn n n 
# 36 nn n n 
#A4lcc Cc c 
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# 46 cc c c 
#5 ee c c 


When there is a quantitative variable related to the factor variable, we could use other functions to reorder the 
levels. Lets take the iris data (help("iris") for more information), for reordering the Species factor by using its 
mean Sepal .Width. 


miris <- iris #help("iris") # copy the data 
with(miris, tapply(Sepal.Width, Species, mean) ) 
# setosa versicolor virginica 
# 3.428 PSI TAG) 2.974 


miris$Species.o<-with(miris, reorder (Species, -Sepal.Width) ) 
levels(mirisSSpecies.o) 
# [1] "setosa" "virginica" "versicolor" 


The usual boxplot (Say: with(miris, boxplot(Petal.Width~Species)) will show the especies in this order: setosa, 
versicolor, and virginica. But using the ordered factor we get the species ordered by its mean Sepal .Width: 


boxplot(Petal.Width~Species.o, data = miris, 
xlab = "Species", ylab = "Petal Width", 
main = "Iris Data, ordered by mean sepal width", varwidth = TRUE, 
col = 2:4) 
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lris Data, ordered by mean sepal width 
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Additionally, it is also possible to change the names of levels, combine them into groups, or add new levels. For 
that we use the function of the same name levels. 


fl<-f 

levels(f1) 

2 lal) eee Sie 

levels(f1) <- c("upper", "upper", "CAP") #rename and grouping 
levels(f1) 

# [1] “upper” "CAP" 


f2<-f1 
levels(f2) <- c("upper","CAP", "Number") #add Number level, which is empty 
levels(f2) 


# [1] "upper" "CAP" "Number" 

f2[length(f2) :(length(f2)+5)]<-"Number" # add cases for the new level 
table(f2) 

# f2 

# upper CAP Number 

# 33 (72 6 
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£3<-f1 

levels(f3) <- list(G1 = "upper", G2 = "CAP", G3 = "Number") # The same using list 
levels(f3) 

Ho alleeeGile G2e. eGSw 

f3[length(f3) :(length(f3)+6)]<-"G3" ## add cases for the new level 

table(f3) 

# f3 

# G1 G2 G3 

Heo ii 7 


- Ordered factors 


Finally, we know that ordered factors are different from factors, the first one are used to represent ordinal data, 
and the second one to work with nominal data. At first, it does not make sense to change the order of levels for 
ordered factors, but we can change its labels. 


ordvar<-rep(c("Low", "Medium", "High"), times=c(7,2,4)) 


of<-ordered(ordvar, levels=c("Low", "Medium", "High")) 
levels(of) 

# [1] “Low" "Medium" "High" 

of1<-of 


levels(of1)<- c("LOW", "MEDIUM", "HIGH") 

levels(of1) 

# [1] "LOW" "MEDIUM" "HIGH" 

is.ordered(of1) 

# [1] TRUE 

of] 

# [1] LOW LOW LOW LOW LOW LOW LOW MEDIUM MEDIUM HIGH HIGH HIGH HIGH 
# Levels: LOW < MEDIUM < HIGH 


Section 29.4: Rebuilding factors from zero 


Problem 


Factors are used to represent variables that take values from a set of categories, known as Levels in R. For example, 
some experiment could be characterized by the energy level of a battery, with four levels: empty, low, normal, and 
full. Then, for 5 different sampling sites, those levels could be identified, in those terms, as follows: 


full, full, normal, empty, low 


Typically, in databases or other information sources, the handling of these data is by arbitrary integer indices 
associated with the categories or levels. If we assume that, for the given example, we would assign, the indices as 
follows: 1 = empty, 2 = low, 3 = normal, 4 = full, then the 5 samples could be coded as: 


4,4,3,1,2 


It could happen that, from your source of information, e.g. a database, you only have the encoded list of integers, 
and the catalog associating each integer with each level-keyword. How can a factor of R be reconstructed from that 
information? 


Solution 
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We will simulate a vector of 20 integers that represents the samples, each of which may have one of four different 
values: 


set.seed(18) 
ii <- sample(1:4, 20, replace=T) 
ii 


11]43411323213412413141 


The first step is to make a factor, from the previous sequence, in which the levels or categories are exactly the 
numbers from 1 to 4. 


fii <- factor(ii, levels=1:4) # it is necessary to indicate the numeric levels 


fii 


(1143411323213412413141 
Levels:1234 


Now simply, you have to dress the factor already created with the index tags: 
levels(fii) <- c("empty", "low", "normal", "full") 


fii 


[1] full normal full empty empty normal low normal low empty 
[11] normal full empty low full empty normal empty full empty 
Levels: empty low normal full 
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Chapter 30: Pattern Matching and 
Replacement 


This topic covers matching string patterns, as well as extracting or replacing them. For details on defining 
complicated patterns see Regular Expressions. 


Section 30.1: Finding Matches 


# example data 
test_sentences <- c("The quick brown fox", "jumps over the lazy dog") 


Is there a match? 


grep1() is used to check whether a word or regular expression exists in a string or character vector. The function 
returns a TRUE/FALSE (or "Boolean") vector. 


Notice that we can check each string for the word "fox" and receive a Boolean vector in return. 


grepl("fox", test_sentences) 
#[1] TRUE FALSE 


Match locations 


grep takes in a character string and a regular expression. It returns a numeric vector of indexes.This will return 
which sentence contains the word "fox" in it. 


grep("fox", test_sentences) 
Fl a 


Matched values 


To select sentences that match a pattern: 


# each of the following lines does the job: 
test_sentences[grep("fox", test_sentences) ] 
test_sentences[grepl("fox", test_sentences) ] 
grep("fox", test_sentences, value = TRUE) 

# [1] "The quick brown fox" 


Details 


Since the "fox" pattern is just a word, rather than a regular expression, we could improve performance (with either 
grep Or grep]) by specifying fixed = TRUE. 


grep("fox", test_sentences, fixed = TRUE) 
#0 1 


To select sentences that don't match a pattern, one can use grep with invert = TRUE; or follow subsetting rules 
with -grep(...) or !grepl(...). 


In both grepl(pattern, x) and grep(pattern, x), the x parameter is vectorized, the pattern parameter is not. As 
a result, you cannot use these directly to match pattern[1] against x[1], pattern[2] against x[2], and so on. 


Summary of matches 


After performing the e.g. the grep1 command, maybe you want to get an overview about how many matches where 
TRUE or FALSE. This is useful e.g. in case of big data sets. In order to do so run the summary command: 
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# example data 
test_sentences <- c("The quick brown fox", "jumps over the lazy dog") 


# find matches 
matches <- grepl("fox", test_sentences) 


# overview 
summary (matches) 


Section 30.2: Single and Global match 


When working with regular expressions one modifier for PCRE is g for global match. 

In R matching and replacement functions have two version: first match and global match: 
¢ sub(pattern, replacement, text) will replace the first occurrence of pattern by replacement in text 
e gsub(pattern, replacement, text) will do the same as sub but for each occurrence of pattern 
¢ regexpr(pattern, text) will return the position of match for the first instance of pattern 


® gregexpr(pattern, text) will return all matches. 


Some random data: 


set.seed(123) 
teststring <- paste@(sample(letters,20),collapse="") 


# teststring 
#[1] "“htjuwakqxzpgrsbncvyo" 


Let's see how this works if we want to replace vowels by something else: 


sub("[aeiouy]"," ** HERE WAS A VOWEL** ",teststring) 
#[1] "htj ** HERE WAS A VOWEL** wakqxzpgrsbncvyo" 


gsub("[aeiouy]"," ** HERE WAS A VOWEL** ",teststring) 
#[1] “htj ** HERE WAS A VOWEL** w ** HERE WAS A VOWEL** kqxzpgrsbncv ** HERE WAS A VOWEL** ** HERE 
WAS A VOWEL** " 


Now let's see how we can find a consonant immediately followed by one or more vowel: 


regexpr("[*aeiou] [aeiou]+", teststring) 


#[1] 3 

#attr(, "match.length" ) 
Fl) 2 

#attr(, "useBytes" ) 
#[1] TRUE 


We have a match on position 3 of the string of length 2, i.e: ju 
Now if we want to get all matches: 


gregexpr("[“aeiou][aeiou]+",teststring) 
#[[1]] 

call), 6) 0 Sie ae) 

#attr(, "match.length" ) 

call) ese 2 
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#attr(, "useBytes" ) 
#[1] TRUE 


All this is really great, but this only give use positions of match and that's not so easy to get what is matched, and 
here comes regmatches it's sole purpose is to extract the string matched from regexpr, but it has a different syntax. 


Let's save our matches in a variable and then extract them from original string: 


matches <- gregexpr("[‘aeiou] [aeiou]+", teststring) 
regmatches(teststring,matches) 


#[[1]] 
FA a iUeeaWan yO 


This may sound strange to not have a shortcut, but this allow extraction from another string by the matches of our 
first one (think comparing two long vector where you know there's is a common pattern for the first but not for the 
second, this allow an easy comparison): 


teststring2 <- "this is another string to match against" 
regmatches(teststring2,matches) 
#[[1]] 


#[1] erty " aie Wr hd 


Attention note: by default the pattern is not Perl Compatible Regular Expression, some things like lookarounds are 
not supported, but each function presented here allow for perl=TRUE argument to enable them. 


Section 30.3: Making substitutions 


# example data 
test_sentences <- c("The quick brown fox quickly", "jumps over the lazy dog") 


Let's make the brown fox red: 


sub("brown", "red", test_sentences) 
#[1] "The quick red fox quickly" "jumps over the lazy dog" 


Now, let's make the "fast" fox act "fastly". This won't do it: 


sub("quick", "fast", test_sentences) 
#[1] "The fast red fox quickly" “jumps over the lazy dog" 


sub only makes the first available replacement, we need gsub for global replacement: 


gsub("quick", "fast", test_sentences) 
#[1] "The fast red fox fastly" “jumps over the lazy dog" 


See Modifying strings by substitution for more examples. 


Section 30.4: Find matches in big data sets 


In case of big data sets, the call of grep1("fox", test_sentences) does not perform well. Big data sets are e.g. 
crawled websites or million of Tweets, etc. 


The first acceleration is the usage of the perl = TRUE option. Even faster is the option fixed = TRUE. A complete 
example would be: 
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# example data 
test_sentences <- c("The quick brown fox", "jumps over the lazy dog") 


grepl("fox", test_sentences, perl = TRUE) 
#[1] TRUE FALSE 


In case of text mining, often a corpus gets used. A corpus cannot be used directly with grep1. Therefore, consider 
this function: 


searchCorpus <- function(corpus, pattern) { 
return(tm_index(corpus, FUN = function(x) { 
grepl(pattern, x, ignore.case = TRUE, perl = TRUE) 
aD) 
} 
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Chapter 31: Run-length encoding 
Section 31.1: Run-length Encoding with rle 


Run-length encoding captures the lengths of runs of consecutive elements in a vector. Consider an example vector: 
date Sse (2 2 eS edin A Aa ith lh) 
The rle function extracts each run and its length: 


r <- rle(dat) 

. 

# Run Length Encoding 

#  slengthss ante[<6)/ 1371 
# values : num [1:6] 23 


The values for each run are captured in r$values: 


rSvalues 
ee | a) eh ee 


This captures that we first saw a run of 1's, then a run of 2's, then a run of 3's, then a run of 1's, and so on. 
The lengths of each run are captured in rSlengths: 


rSlengths 
ce (id) abhi 


We see that the initial run of 1's was of length 1, the run of 2's that followed was of length 3, and so on. 


Section 31.2: Identifying and grouping by runs in base R 


One might want to group their data by the runs of a variable and perform some sort of analysis. Consider the 
following simple dataset: 


(dat <- data.frame(x = c(1, 1, 2, 2, 2, 1), y = 1:6)) 


# oR HR HHH 


Anh WDN 
=a NNN — — x 
AuoRPWN HX 


The variable x has three runs: a run of length 2 with value 1, a run of length 3 with value 2, and a run of length 1 
with value 1. We might want to compute the mean value of variable y in each of the runs of variable x (these mean 
values are 1.5, 4, and 6). 


In base R, we would first compute the run-length encoding of the x variable using rle: 


(r <- rle(datSx)) 

# Run Length Encoding 

# lengths: ant [a ssile2°391 
# values : num [1:3] 2 
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The next step is to compute the run number of each row of our dataset. We know that the total number of runs is 
length(rS$lengths), and the length of each run is r$lengths, So we can compute the run number of each of our 
runs with rep: 


(run.id <- rep(seq_along(rSlengths), rSlengths) ) 
ee Ad a eae eS 


Now we can use tapply to compute the mean y value for each run by grouping on the run id: 


data. frame(x=rSvalues, meanY=tapply(datSy, run.id, mean) ) 
meanY 


Section 31.3: Run-length encoding to compress and 
decompress vectors 


Long vectors with long runs of the same value can be significantly compressed by storing them in their run-length 
encoding (the value of each run and the number of times that value is repeated). As an example, consider a vector 
of length 10 million with a huge number of 1's and only a small number of 0's: 


set.seed(144) 
dat <- sample(rep(@:1, c(1, 1e5)), 1e7, replace=TRUE) 


table(dat) 
# 2) 1 
# 103 9999897 


Storing 10 million entries will require significant space, but we can instead create a data frame with the run-length 
encoding of this vector: 


rle.df <- with(rle(dat), data.frame(values, lengths) ) 
dim(rle.df) 

# [1] 207 2 

head(rle.df) 


# values lengths 
#1 1 52818 
# 2 2) 1 
# 3 1 219329 
#4 2) 1 
#5 1 318306 
# 6 2) 1 


From the run-length encoding, we see that the first 52,818 values in the vector are 1's, followed by a single 0, 
followed by 219,329 consecutive 1's, followed by a 0, and so on. The run-length encoding only has 207 entries, 
requiring us to store only 414 values instead of 10 million values. As rle.df is a data frame, it can be stored using 
standard functions like write.csv. 


Decompressing a vector in run-length encoding can be accomplished in two ways. The first method is to simply call 
rep, passing the values element of the run-length encoding as the first argument and the lengths element of the 
run-length encoding as the second argument: 


decompressed <- rep(rle.dfSvalues, rle.dfSlengths) 


We can confirm that our decompressed data is identical to our original data: 
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identical(decompressed, dat) 
# [1] TRUE 


The second method is to use R's built-in inverse. rle function on the rle object, for instance: 


rle.obj <- rle(dat) # create a rle object here 
class(rle.obj) 

Fo) rley 

dat.inv <- inverse.rle(rle.obj) # apply the inverse.rle on the rle object 


We can confirm again that this produces exactly the original dat: 


identical(dat.inv, dat) 
# [1] TRUE 


Section 31.4: Identifying and grouping by runs in data.table 
The data.table package provides a convenient way to group by runs in data. Consider the following example data: 
library(data.table) 


(DT <- data.table(x = c(1, 1, 2, 2, 2, 1), y = 1:6)) 
# Kany) 


# HH HH H 
Auf WD 
A NNDB = 
Auf WDNY 


The variable x has three runs: a run of length 2 with value 1, a run of length 3 with value 2, and a run of length 1 
with value 1. We might want to compute the mean value of variable y in each of the runs of variable x (these mean 
values are 1.5, 4, and 6). 


The data.table rleid function provides an id indicating the run id of each element of a vector: 


rleid(DTSx) 
ef Ii) eee 


One can then easily group on this run ID and summarize the y data: 


DT[,mean(y) ,by=.(x, rleid(x))] 


# x rleid V1 
Te leat as 5 
eee 2 4.0 
ie Se 4 3 6.0 
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Chapter 32: Speeding up tough-to- 
vectorize code 


Section 32.1: Soeeding tough-to-vectorize for loops with Rcpp 


Consider the following tough-to-vectorize for loop, which creates a vector of length len where the first element is 
specified (first) and each element x_i is equal to cos(x_{i-1} + 1): 


repeatedCosPlusOne <- function(first, len) { 
x <- numeric(len) 
x[1] <- first 
for (i in 2:len) { 
x[i] <- cos(x[i-1] + 1) 


} 


return(x) 


This code involves a for loop with a fast operation (cos(x[i-1]+1)), which often benefit from vectorization. 
However, it is not trivial to vectorize this operation with base R, since R does not have a "cumulative cosine of x+1" 


function. 


One possible approach to speeding this function would be to implement it in C++, using the Rcpp package: 


library(Rcpp) 
cppFunction("NumericVector repeatedCosPlusOneRcpp(double first, int len) { 
NumericVector x(len); 
x (Oil) = first: 
for (int d= a-< lene +a). { 
x [el =s cos (cai )e 


} 


return x; 


ae) 
This often provides significant speedups for large computations while yielding the exact same results: 


all.equal(repeatedCosPlusOne(1, 1e6), repeatedCosPlusOneRcpp(1, 1e6) ) 
# [1] TRUE 

system. time(repeatedCosPlusOne(1, 1¢e6)) 

# user system elapsed 

# e274: @.015 Hees) 

system. time(repeatedCosPlusOneRcpp(1, 1e6)) 

# user system elapsed 

# @.028 @.001 @.030 


In this case, the Rcpp code generates a vector of length 1 million in 0.03 seconds instead of 1.31 seconds with the 
base R approach. 


Section 32.2: Speeding tough-to-vectorize for loops by byte 
compiling 


Following the Rcpp example in this documentation entry, consider the following tough-to-vectorize function, which 
creates a vector of length len where the first element is specified (first) and each element x_i is equal to 


cos(x_{i-1} + 1): 


repeatedCosPlusOne <- function(first, len) { 
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x <- numeric(len) 
x[1] <- first 
for (i in 2:len) { 
x[i] <- cos(x[i-1] + 1) 
} 


return(x) 


One simple approach to speeding up such a function without rewriting a single line of code is byte compiling the 


code using the R compile package: 


library(compiler) 
repeatedCosPlusOneCompiled <- cmpfun(repeatedCosPlusOne) 


The resulting function will often be significantly faster while still returning the same results: 


all.equal(repeatedCosPlusOne(1, 1e6), repeatedCosPlusOneCompiled(1, 1e6)) 
# [1] TRUE 

system. time(repeatedCosPlusOne(1, 1e6)) 

# user system elapsed 

# els @.014 12204 

system. time(repeatedCosPlusOneCompiled(1, 1e6)) 

# user system elapsed 

# @.339 @.802 @.341 


In this case, byte compiling sped up the tough-to-vectorize operation on a vector of length 1 million from 1.20 


seconds to 0.34 seconds. 


Remark 


The essence of repeatedCosPlusOne, as the cumulative application of a single function, can be expressed more 


transparently with Reduce: 


iterFunc <- function(init, n, func) { 

funcs <- replicate(n, func) 

Reduce(function(., f) f(.), funcs, init = init, accumulate = TRUE) 
} 
repeatedCosPlusOne_vec <- function(first, len) { 

iterFunc(first, len - 1, function(.) cos(. + 1)) 


} 


repeatedCosPlusOne_vec may be regarded as a 'vectorization" of repeatedCosPlusOne. However, it can be 
expected to be s/ower by a factor of 2: 


library(microbenchmark) 

microbenchmark( 
repeatedCosPlusOne(1, 1e4), 
repeatedCosPlusOne_vec(1, 1e4) 


) 


#> Unit: milliseconds 


#> expr min lq mean median uq max neval 
cld 
#> repeatedCosPlusOne(1, 10000) 8.349261 9.216724 10.22715 10.23095 11.10817 14.33763 100 
a 
#> repeatedCosPlusOne_vec(1, 10000) 14.406291 16.236153 17.55571 17.22295 18.59085 24.37059 100 
b 
160 
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Chapter 33: Introduction to Geographical 
Maps 


See also I/O for geographic data 


Section 33.1: Basic map-making with mapQ) from the package 
maps 


The function map() from the package maps provides a simple starting point for creating maps with R. 
A basic world map can be drawn as follows: 


require(maps) 
map() 


The color of the outline can be changed by setting the color parameter, col, to either the character name or hex 
value of a color: 


require(maps) 
map(col = "cornflowerblue") 
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To fill land masses with the color in col we can set fill = TRUE: 


require(maps) 
map(fill = TRUE, col = c("“cornflowerblue")) 
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A vector of any length may be supplied to col when fill = TRUE is also set: 


require(maps) 
map(fill = TRUE, col = c("cornflowerblue", "limegreen", "“hotpink")) 


In the example above colors from col are assigned arbitrarily to polygons in the map representing regions and 
colors are recycled if there are fewer colors than polygons. 


We can also use color coding to represent a statistical variable, which may optionally be described in a legend. A 
map created as such is known as a "choropleth". 


The following choropleth example sets the first argument of map(), which is database to “county” and “state” to 
color code unemployment using data from the built-in datasets unemp and county. fips while overlaying state lines 
in white: 


require(maps) 

if(require(mapproj)) { # mapproj is used for projection="polyconic" 
# color US county map by 2009 unemployment rate 

match counties to map using FIPS county codes 

Based on J's solution to the "Choropleth Challenge" 

Code improvements by Hack-R (hack-r.github.io) 


i 


# load data 

# unemp includes data for some counties not on the "lower 48 states" county 

# map, such as those in Alaska, Hawaii, Puerto Rico, and some tiny Virginia 
cities 

data(unemp) 

data(county.fips) 


He 


# define color buckets 
colors = c("paleturquoise", "skyblue", "“cornflowerblue", "blueviolet", "hotpink", "darkgrey") 
unempScolorBuckets <- as.numeric(cut(unempSunemp, c(@, 2, 4, 6, 8, 1@, 1@@))) 
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Hlegutext <- e("<2%' 5 “24% “A-6%" "G-8%"", “8-10% | 7 >18%") 


# align data with map definitions by (partial) matching state, county 

# names, which include multiple polygons for some counties 

cnty.fips <- county.fips$fips[match(map("county", plot=FALSE)Snames, 
county. fipsSpolyname) ] 

colorsmatched <- unempScolorBuckets[match(cnty.fips, unempSfips) ] 


# draw map 

par(mar=c(1, 1, 2, 1) + 0.1) 

map("county", col = colors[colorsmatched], fill = TRUE, resolution = 8, 
lty = @, projection = "polyconic" ) 

map("state", col = "white", fill = FALSE, add = TRUE, lty = 1, lwd = 8.1, 
projection="polyconic") 

title("unemployment by county, 2009") 

legend("topright", leg.txt, horiz = TRUE, fill = colors, cex=@.6) 


unemployment by county, 2009 


Section 33.2: 50 State Maps and Advanced Choropleths with 
Google Viz 


A common question is how to juxtapose (combine) physically separate geographical regions on the same map, such 
as in the case of a choropleth describing all 50 American states (The mainland with Alaska and Hawaii juxtaposed). 


Creating an attractive 50 state map is simple when leveraging Google Maps. Interfaces to Google's API include the 
packages googleVis, ggmap, and RgoogleMaps 


require(googleVis) 
G4 <- gvisGeoChart(CityPopularity, locationvar='City', colorvar='Popularity'’, 
options=list(region='US', height=350, 


displayMode='markers’', 
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colorAxis="{values:[200, 400,600,800], 
colors:[\'red', \'pink\', \'orange',\'green' ]}") 


) 
plot(G4) 


200 800 


Data: CityPopularity * Chart ID: GeoChartiID28504adb439a « googleVis-0.5.2 


R version 3.1.0 (2014-04-10) » Google Terms of Use *« Documentation and Data Polic 


The function gvisGeoChart() requires far less coding to create a choropleth compared to older mapping methods, 
such as map() from the package maps. The colorvar parameter allows easy coloring of a statistical variable, at a 
level specified by the locationvar parameter. The various options passed to options as a list allow customization 
of the map's details such as size (height), shape (markers), and color coding (colorAxis and colors). 


Section 33.3: Interactive plotly maps 


The plotly package allows many kind of interactive plots, including maps. There are a few ways to create a map in 
plotly. Either supply the map data yourself (via plot_ly() or ggplotly()), use plotly's "native" mapping 
capabilities (via plot_geo() or plot_mapbox()), or even a combination of both. An example of supplying the map 
yourself would be: 


library(plotly) 

map_data("county") %>% 
group_by(group) %>% 
plot_ly(x = ~long, y = ~lat) %>% 
add_polygons() %>% 


layout ( 
xaxis = list(title = "", showgrid = FALSE, showticklabels = FALSE), 
yaxis = list(title = "", showgrid = FALSE, showticklabels = FALSE) 
) 
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For a combination of both approaches, swap plot_ly() for plot_geo() or plot_mapbox() in the above example. 
See the plotly book for more examples. 


The next example is a "strictly native" approach that leverages the layout.geo attribute to set the aesthetics and 
zoom level of the map. It also uses the database world.cities from maps to filter the Brazilian cities and plot them 
on top of the "native" map. 


The main variables: pophis a text with the city and its population (which is shown upon mouse hover); gis a ordered 
factor from the population's quantile. ge has information for the layout of the maps. See the package 
documentation for more information. 


library(maps) 

dfb <- world.cities[world.citiesScountry.etc=="Brazil", ] 
library(plotly) 

dfbSpoph <- paste(dfbSname, "Pop", round(dfbSpop/1e6,2), " millions") 
dfbSq <- with(dfb, cut(pop, quantile(pop), include.lowest = T)) 
levels(dfb$q) <- paste(c("1st", "2nd", "3rd", "4th"), "Quantile") 
dfbSq <- as.ordered(dfb$q) 


ge <- list( 
scope = ‘south america’, 
showland = TRUE, 
landcolor = toRGB("gray85"), 
subunitwidth = 1, 
countrywidth = 1, 
subunitcolor = toRGB("white"), 
countrycolor = toRGB("white") 


) 


plot_geo(dfb, lon = ~long, lat = ~lat, text = ~poph, 
marker = ~list(size = sqrt(pop/10000) + 1, line = list(width = @)), 
color = ~q, locationmode = ‘country names') %>% 

layout(geo = ge, title = '‘Populations<br>(Click legend to toggle) ') 
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Populations a hi. 
(Click legend to toggle) 


® 4th Quantile 
* 3rd Quantile 
2nd Quantile 


tst Quantile 


4th Quantile 


Section 33.4: Making Dynamic HTML Maps with Leaflet 


Leaflet is an open-source JavaScript library for making dynamic maps for the web. RStudio wrote R bindings for 
Leaflet, available through its leaflet package, built with htmlwidgets. Leaflet maps integrate well with the 
RMarkdown and Shiny ecosystems. 


The interface is piped, using a leaflet() function to initialize a map and subsequent functions adding (or 
removing) map layers. Many kinds of layers are available, from markers with popups to polygons for creating 
choropleth maps. Variables in the data.frame passed to leaflet() are accessed via function-style ~ quotation. 


To map the state.name and state.center datasets: 


library(leaflet) 


data.frame(state.name, state.center) %>% 
leaflet() %>% 
addProviderTiles('Stamen.Watercolor') %>% 
addMarkers(1ng = ~x, lat = -~y, 
popup = ~state.name, 
clusterOptions = markerClusterOptions() ) 
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Leaflet | Map tiles by Stamen Design, CC BY 3.0 — Map data © OpanStraetMap 


(Screenshot; click for dynamic version.) 


Section 33.5: Dynamic Leaflet maps in Shiny applications 


The Leaflet package is designed to integerate with Shiny 
In the ui you call leafletOutput() and inthe server you call renderLeaflet() 


library(shiny) 
library(leaflet) 


ui <- fluidPage( 
leafletOutput ( ) 
) 
server <- function(input, output, session) { 
output$my_leaf <- renderLeaflet({ 
leaflet() %>% 
addProviderTiles( ) %>% 
setView(lat = - , Lng = , zoom = ) 
}) 
} 


shinyApp(ui, server) 


However, reactive inputs that affect the renderLeaflet expression will cause the entire map to be redrawn each 
time the reactive element is updated. 
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Therefore, to modify a map that's already running you should use the leafletProxy() function. 


Normally you use leaflet to create the static aspects of the map, and leafletProxy to manage the dynamic 
elements, for example: 


library(shiny) 
library(leaflet) 


ui <- fluidPage( 
sliderInput(inputId = "slider", 
label = "values", 
min = 0, 
max = 100, 
value = @, 
step = 1), 
leafletOutput("my_leaf") 
) 


server <- function(input, output, session) { 
set.seed( 123456) 
df <- data.frame(latitude = sample(seq(-38.5, -37.5, by = 0.01), 100), 
longitude = sample(seq(144.0, 145.0, by = 0.01), 100), 
value = seq(1,10@)) 


## create static element 
output$my_leaf <- renderLeaflet({ 


leaflet() %>% 
addProviderTiles('Hydda.Full') %>% 
setView(lat = -37.8, Ing = 144.8, zoom = 8) 


Y} 


## filter data 

df_filtered <- reactive({ 
df[dfSvalue >= inputSslider, ] 

) 


## respond to the filtered data 
observe({ 


leafletProxy(mapId = "my_leaf", data = df_filtered()) %>% 


clearMarkers() %>%  ## clear previous markers 
addMarkers() 


}) 
} 


shinyApp(ui, server) 
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Chapter 34: Set operations 


Section 34.1: Set operators for pairs of vectors 
Comparing sets 


In R, a vector may contain duplicated elements: 


w= e("A", “AT 


However, a set contains only one copy of each element. R treats a vector like a set by taking only its distinct 
elements, so the two vectors above are regarded as the same: 


setequal(v, w) 
# TRUE 


Combining sets 
The key functions have natural names: 


Semi 22. Sy) 
y = ¢(2, 4) 


union(x, y) 
#1234 


intersect(x, y) 
# 2 


setdiff(x, y) 
#1 3 


These are all documented on the same page, ?union. 


Section 34.2: Cartesian or "cross" products of vectors 


To find every vector of the form (x, y) where x is drawn from vector X and y from Y, we use expand. grid: 


X 
v; 


(eae 28) 
c(4, 5) 


expand.grid(X, Y) 


# Var1 Var2 
# 1 1 4 
#2 1 4 
# 3 oD 4 
#4 1 5 
#5 1 5 
# 6 2 5 


The result is a data.frame with one column for each vector passed to it. Often, we want to take the Cartesian 
product of sets rather than to expand a "grid" of vectors. We can use unique, lapply and do.ca1l: 


m = do.call(expand.grid, lapply(list(X, Y), unique) ) 
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# Var1 Var2 
# 1 1 4 
#2 2 4 
# 3 1 5 
#4 2 5 


Applying functions to combinations 


If you then want to apply a function to each resulting combination f(x, y), it can be added as another column: 


mSp = with(m, Var1*Var2) 
Var1 Var2 p 

1 1 4 4 

2. 2 4 8 

3 1 By) 

4 2 ey ta) 


This approach works for as many vectors as we need, but in the special case of two, it is sometimes a better fit to 
have the result in a matrix, which can be achieved with outer: 


uX = unique(X) 
uY = unique(Y) 


outer(setNames(uX, uX), setNames(uY, uY), ~*°) 


For related concepts and tools, see the combinatorics topic. 


Section 34.3: Set membership for vectors 


The %in% operator compares a vector with a set. 


w %in% v 
# TRUE TRUE 


v %in% w 
# TRUE 


Each element on the left is treated individually and tested for membership in the set associated with the vector on 
the right (consisting of all its distinct elements). 


Unlike equality tests, %in% always returns TRUE or FALSE: 


c(1, NA) %in% c(1, 2, 3, 4) 
# TRUE FALSE 


The documentation is at ?°>%in%>. 


Section 34.4: Make unique / drop duplicates / select distinct 
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elements from a vector 
unique drops duplicates so that each element in the result is unique (only appears once): 
=e (ee lee 2 ol) 


unique(x) 
#2 1 


Values are returned in the order they first appeared. 
duplicated tags each duplicated element: 


duplicated(x) 
# FALSE FALSE TRUE TRUE TRUE 


anyDuplicated(x) > @L is a quick way of checking whether a vector contains any duplicates. 


Section 34.5: Measuring set overlaps / Venn diagrams for 
vectors 


To count how many elements of two sets overlap, one could write a custom function: 


xtab_set <- function(A, B){ 
both <- union(A, B) 


inA <- both %in% A 
inB <- both %in% B 
return(table(inA, inB)) 

} 

A = 1:20 

B = 10:30 


xtab_set(A, B) 


# inB 

# inA FALSE TRUE 
# FALSE i) 10 
# TRUE 9 al 


AVenn diagram, offered by various packages, can be used to visualize overlap counts across multiple sets. 
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Chapter 35: tidyverse 


Section 35.1: tidyverse: an overview 
What is tidyverse? 


tidyverse is the fast and elegant way to turn basic R into an enhanced tool, redesigned by Hadley/Rstudio. The 
development of all packages included in tidyverse follow the principle rules of The tidy tools manifesto. But first, 
let the authors describe their masterpiece: 


The tidyverse is a set of packages that work in harmony because they share common data 
representations and API design. The tidyverse package is designed to make it easy to install and load core 
packages from the tidyverse in a single command. 


The best place to learn about all the packages in the tidyverse and how they fit together is R for Data 
Science. Expect to hear more about the tidyverse in the coming months as | work on improved package 
websites, making citation easier, and providing a common home for discussions about data analysis with 
the tidyverse. 


(source)) 


How to use it? 
Just with the ordinary R packages, you need to install and load the package. 


install.package("tidyverse" ) 
library("tidyverse" ) 


The difference is, on a single command a couple of dozens of packages are installed/loaded. As a bonus, one may 
rest assured that all the installed/loaded packages are of compatible versions. 


What are those packages? 
The commonly known and widely used packages: 


e geplot2: advanced data visualisation SO_doc 

e dplyr: fast (Rcpp) and coherent approach to data manipulation SO_doc 

e tidyr: tools for data tidying SO_doc 

e readr: for data import. 

e purrr: makes your pure functions purr by completing R's functional programming tools with important 
features from other languages, in the style of the JS packages underscore.js, lodash and lazy.js. 

e tibble: a modern re-imagining of data frames. 

¢ magrittr: piping to make code more readable SO_doc 


Packages for manipulating specific data formats: 


e hms: easily read times 

e stringr: provide a cohesive set of functions designed to make working with strings as easy as posssible 
e lubridate: advanced date/times manipulations SO_doc 

e forcats: advanced work with factors. 


Data import: 
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e DBI: defines a common interface between the R and database management systems (DBMS) 

e haven: easily import SPSS, SAS and Stata files SO_doc 

e httr: the aim of httr is to provide a wrapper for the curl package, customised to the demands of modern web 
APIs 

e jsonlite: a fast JSON parser and generator optimized for statistical data and the web 

e readxl: read.xls and .xlsx files without need for dependency packages SO_doc 

e rvest: rvest helps you scrape information from web pages SO_doc 

e xml2: for XML 


And modelling: 


e modelr: provides functions that help you create elegant pipelines when modelling 
e broom: easily extract the models into tidy data 


Finally, tidyverse suggest the use of: 


e knitr: the amazing general-purpose literate programming engine, with lightweight API's designed to give 
users full control of the output without heavy coding work. SO_docs: one, two 
e rmarkdown: Rstudio's package for reproducible programming. SO_docs: one, two, three, four 


Section 35.2: Creating tbl_df’s 


A tbl_df (pronounced tibble diff) is a variation of a data frame that is often used in tidyverse packages. It is 
implemented in the tibble package. 


Use the as_data_frame function to turn a data frame into a tbl_df: 


library(tibble) 
mtcars_tbl <- as_data_frame(mtcars) 


One of the most notable differences between data.frames and tbl_dfs is how they print: 


# A tibble: 32 x 11 
mpg cyl disp hp drat wt qsec vs am gear carb 


* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 
1 21.0 6 160.08 110 3.98 2.620 16.46 2) 1 4 4 
Z ZAG 6 160.8 116 3.98 2.875 17.02 2) 1 4 4 
3 Paphsts) 4 108.0 O23 85 2326 enon i 1 4 1 
Ae 24 6 258.0 110 3.68 3.215 19.44 ‘l 7) 3 1 
5 NS e7, 8 360.0 175° 3.15 3.446) 17 62 7) 7) 3 Z 
6 18.1 6 225.0 105 2.76 3.460 20.22 1 7) 3 1 
7 1453 8 360.0 245-3221) 30570 15.84 2) 7) 3 4 
8 24.4 4 146.7 62 3.69 3.198 28.00 1 7) 4 Zz 
9 228 4 148.8 Oo sn 02 356.2259 8 1 7) 4 2 
10 19.2 6 167.6 123 3.92 37446 118.3 1 7) 4 4 
# . with 22 more rows 


e The printed output includes a summary of the dimensions of the table (32 x 11) 
e It includes the type of each column (db1) 
e It prints a limited number of rows. (To change this use options(tibble.print_max = [number])). 


Many functions in the dplyr package work naturally with tbl_dfs, such as group_by(). 
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Chapter 36: Rcpp 
Section 36.1: Extending Rcpp with Plugins 


Within C++, one can set different compilation flags using: 
// [[Repp: :plugins(name) ] ] 
List of the built-in plugins: 


// built-in C++11 plugin 
// |[Repp: :plugins(cpp11) ]] 


// built-in C++11 plugin for older g++ compiler 
// [[Repp::plugins(cpp@x) ] ] 


// built-in C++14 plugin for C++14 standard 
// |[Repp: :plugins(cpp14) ] ] 


// built-in C++1y plugin for C++14 and C++17 standard under development 
// [[Repp: :plugins(cpp1y) ] ] 


// built-in OpenMP++11 plugin 
// |[Repp: :plugins(openmp) ] ] 


Section 36.2: Inline Code Compile 


Rcpp features two functions that enable code compilation inline and exportation directly into R: cppFunction() and 
evalCpp(). A third function called sourceCpp() exists to read in C++ code in a separate file though can be used akin 
to cppFunction(). 


Below is an example of compiling a C++ function within R. Note the use of "" to surround the source. 


# Note - This is R code. 
# cppFunction in Rcpp allows for rapid testing. 
require(Rcpp) 


# Creates a function that multiples each element in a vector 
# Returns the modified vector. 

cppFunction(" 

NumericVector exfun(NumericVector x, int i) { 

X = x*1; 

return x; 


P) 
# Calling function in R 
exfun(1:5, 3) 


To quickly understand a C++ expression use: 


# Use evalCpp to evaluate C++ expressions 
evalCpp("std: :numeric_limits<double>: :max()") 
## [1] 1.797693e+308 
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Section 36.3: Rcpp Attributes 


Rcpp Attributes makes the process of working with R and C++ straightforward. The form of attributes take: 


// [[Repp::attribute] ] 
The use of attributes is typically associated with: 


// [[Repp::export] ] 


that is placed directly above a declared function header when reading in a C++ file via sourceCpp(). 


Below is an example of an external C++ file that uses attributes. 


// Add code below into C++ file Rcpp_example.cpp 


#include <Rcpp.h> 
using namespace Rcpp; 


// Place the export tag right above function declaration. 


// [[Repp::export]] 
double muRcpp(NumericVector x) { 


int n = x.size(); // Size of vector 
double sum = @; // Sum value 


// For loop, note cpp index shift to @ 
OT (CUunte = Os ee ins it) 
// Shorthand for sum = sum + x[i] 
sum += x[i]; 


} 


return sum/n; // Obtain and return the Mean 


} 


// Place dependent functions above call or 
// declare the function definition with: 
double muRcpp(NumericVector x); 


// |[Repp::export] ] 
double varRcpp(NumericVector x, bool bias = true) { 


// Calculate the mean using C++ function 
double mean = muRcpp(x) ; 

double sum = @; 

int n = x.size(); 

for(int i = @2 a < ine ith){ 


sum += pow(x[i] - mean, 2.0); // Square 


} 


return sum/(n-bias); // Return variance 


To use this external C++ file within R, we do the following: 


require(Rcpp) 
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# Compile File 
sourceCpp("path/to/file/Rcpp_example.cpp" ) 


# Make some sample data 
xX = 1:5 


all.equal(muRcpp(x), mean(x) ) 
## TRUE 


all.equal(varRcpp(x), var(x)) 
## TRUE 


Section 36.4: Specifying Additional Build Dependencies 


To use additional packages within the Rcpp ecosystem, the correct header file may not be Rcpp.h but 


Rcpp<PACKAGE> .h (as e.g. for RcppArmadillo). It typically needs to be imported and then the dependency is stated 


within 
// [[Repp: :depends(Rcpp<PACKAGE>) ] ] 
Examples: 


// Use the RcppArmadillo package 

// Requires different header file from Rcpp.h 
#include <RcppArmadillo.h> 

// [[Rcpp::depends(RcppArmadillo) | ] 


// Use the RcppEigen package 

// Requires different header file from Rcpp.h 
#include <RcppEigen.h> 

// [|[Rcpp::depends(RcppEigen) | | 


GoalKicker.com - R Notes for Professionals 


178 


Chapter 37: Random Numbers Generator 


Section 37.1: Random permutations 
To generate random permutation of 5 numbers: 


sample(5) 
Fea Soo lie? 


To generate random permutation of any vector: 


sample(10:15) 
ee Miia) ey aes eh Zs 16) 


One could also use the package pracma 


randperm(a, k) 

# Generates one random permutation of k of the elements a, if a is a vector, 
# or of 1:a if a is a single integer. 

# a: integer or numeric vector of some length n. 

# k: integer, smaller as a or length(a). 


# Examples 
library(pracma) 
randperm(1:18, 3) 
[ese 729 


randperm(18, 10) 
[el eA Ss 1G Ss 2 io Si Se 


randperm(seq(2, 18, by=2)) 
[1] 6 410 2 8 


Section 37.2: Generating random numbers using various 
density functions 


Below are examples of generating 5 random numbers using various probability distributions. 


Uniform distribution between 0 and 10 


runif(5, min=@, max=10) 
[1] 2.1724399 8.92809938 6.1969249 9.3303321 2.4054102 


Normal distribution with 0 mean and standard deviation of 1 


rnorm(5, mean=0, sd=1) 
[1] -@.97414402 -@.85722281 -@.08555494 -@.37444299 1.20032409 


Binomial distribution with 10 trials and success probability of 0.5 


rbinom(5, size=1@, prob=@.5) 
lit] ARS S5e 23 


Geometric distribution with 0.2 success probability 


rgeom(5, prob=0.2) 
[i 4 Ss Tee 3 
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Hypergeometric distribution with 3 white balls, 10 black balls and 5 draws 


rhyper(5, m=3, n=10, k=5) 
a Zc a 


Negative Binomial distribution with 10 trials and success probability of 0.8 


rnbinom(5, size=10@, prob=@.8) 
[eS ies AS 2 


Poisson distribution with mean and variance (lambda) of 2 


rpois(5, lambda=2) 
Ula eee 


Exponential distribution with the rate of 1.5 


rexp(5, rate=1.5) 
[1] 1.8993303 @.4799358 @.5578280 1.5630711 @.6228000 


Logistic distribution with 0 location and scale of 1 


rlogis(5, location=0, scale=1) 
[1] @.9498992 -1.0287433 -@.4192311 0@.7028510 -1.2095458 


Chi-squared distribution with 15 degrees of freedom 


rchisq(5, df=15) 
[1] 14.89209 19.36947 10.27745 19.48376 23.32898 


Beta distribution with shape parameters a=1 and b=0.5 


rbeta(5, shapel=1, shape2=@.5) 
[1] @.1670306 8.5321586 @.9869520 @.9548993 8@.9999737 


Gamma distribution with shape parameter of 3 and scale=0.5 


rgamma(5, shape=3, scale=@.5) 
[1] 2.2445984 @.7934152 3.2366673 2.2897537 @.8573059 


Cauchy distribution with 0 location and scale of 1 


rceauchy(5, location=0, scale=1) 
[1] -@.01285116 -@.38918446 8.71016696 10.60293284 -@.68017185 


Log-normal distribution with 0 mean and standard deviation of 1 (on log scale) 


rlnorm(5, meanlog=0, sdlog=1) 
[1] @.8725009 2.9433779 @.3329107 2.5976206 2.8171894 


Weibull distribution with shape parameter of 0.5 and scale of 1 


rweibull(5, shape=0.5, scale=1) 
[1] @.337599112 1.307774557 7.233985075 5.840429942 @.0905751181 


Wilcoxon distribution with 10 observations in the first sample and 20 in second. 


rwilcox(5, 10, 20) 
[1] 111 88 93 108 124 


Multinomial distribution with 5 object and 3 boxes using the specified probabilities 


Goalkicker.com - R Notes for Professionals 180 


rmultinom(5, size=5, prob=c(@.1,0.1,0.8)) 
eal 2 tes A sl 

Pile 3) 2) 1 1 (2) 

al #2 2) i 1 2) 

[3] 3 5} 4) 3 5 


Section 37.3: Random number generator’s reproducibility 


When expecting someone to reproduce an R code that has random elements in it, the set. seed() function 


becomes very handy. For example, these two lines will always produce different output (because that is the whole 
point of random number generators): 


> sample(1:10,5) 
[Also "9S 2 7 ae 
> sample(1:10,5) 
[els eez: Gedo 2 iG 


These two will also produce different outputs: 


> rnorm(5) 

[1] @.4874291 @.7383247 0.5757814 -@.3053884 1.5117812 

> rnorm(5) 

[1] @.38984324 -@.62124058 -2.21469989 1.12493092 -@.04493361 


However, if we set the seed to something identical in both cases (most people use 1 for simplicity), we get two 
identical samples: 


> set.seed(1) 

> sample(letters, 2) 
[1] "9" "3" 

> set.seed(1) 

> sample(letters, 2) 
ao 


and same with, say, rexp() draws: 


> set.seed(1) 

> rexp(5) 

[1] @.7551818 1.1816428 @.1457067 8@.1397953 @.4360686 
> set.seed(1) 

> rexp(5) 

[1] @.7551818 1.1816428 @.1457067 8.1397953 @.4360686 
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Chapter 38: Parallel processing 


Section 38.1: Parallel processing with parallel package 


The base package parallel allows parallel computation through forking, sockets, and random-number generation. 


Detect the number of cores present on the localhost: 


parallel: :detectCores(all.tests = FALSE, logical = TRUE) 


Create a cluster of the cores on the localhost: 


parallelCluster <- parallel: :makeCluster(parallel: :detectCores()) 


First, a function appropriate for parallelization must be created. Consider the mtcars dataset. A regression on mpg 
could be improved by creating a separate regression model for each level of cyl. 


data <- mtcars 

yfactor <- ‘cyl’ 

zlevels <- sort(unique(data[[yfactor]])) 
datay <- data[,1] 

dataz <- data[,2] 

datax <- data[,3:11] 


fitmodel <- function(zlevel, datax, datay, dataz) { 
glm.fit(x = datax[dataz == zlevel,], y = datay[dataz == zlevel]) 
} 


Create a function that can loop through all the possible iterations of zlevels. This is still in serial, but is an 
important step as it determines the exact process that will be parallelized. 


fitmodel <- function(zlevel, datax, datay, dataz) { 
glm.fit(x = datax[dataz == zlevel,], y = datay[dataz == zlevel]) 
} 


for (zlevel in zlevels) { 

print ("*****" 

print(zlevel) 

print(fitmodel(zlevel, datax, datay, dataz)) 
} 


Curry this function: 


worker <- function(zlevel) { 
fitmodel(zlevel,datax, datay, dataz) 
} 


Parallel computing using parallel cannot access the global environment. Luckily, each function creates a local 
environment parallel can access. Creation of a wrapper function allows for parallelization. The function to be 
applied also needs to be placed within the environment. 


wrapper <- function(datax, datay, dataz) { 
# force evaluation of all parameters not supplied by parallelization apply 
force(datax) 
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force(datay) 
force(dataz) 
# these variables are now in an enviroment accessible by parallel function 


# function to be applied also in the environment 
fitmodel <- function(zlevel, datax, datay, dataz) { 

glm.fit(x = datax[dataz == zlevel,], y = datay[dataz == zlevel]) 
} 


# calling in this environment iterating over single parameter zlevel 
worker <- function(zlevel) { 

fitmodel(zlevel,datax, datay, dataz) 
} 


return(worker) 


Now create a cluster and run the wrapper function. 


parallelcluster <- parallel: :makeCluster(parallel: :detectCores()) 
models <- parallel: :parLapply(parallelcluster,zlevels, 
wrapper(datax, datay, dataz)) 


Always stop the cluster when finished. 


parallel: :stopCluster(parallelcluster) 


The parallel package includes the entire apply() family, prefixed with par. 


Section 38.2: Parallel processing with foreach package 


The foreach package brings the power of parallel processing to R. But before you want to use multi core CPUs you 
have to assign a multi core cluster. The doSNOW package is one possibility. 


A simple use of the foreach loop is to calculate the sum of the square root and the square of all numbers from 1 to 
100000. 


library(foreach) 
library (doSNOW) 


cl <- makeCluster(5, type = "SOCK") 
registerDoSNOW(c1) 


f <- foreach(i = 1:100000, .combine = c, .inorder = F) %dopar% { 
k <- i ** 2 + sqrt(i) 
k 


The structure of the output of foreach is controlled by the .combine argument. The default output structure is a 
list. In the code above, c is used to return a vector instead. Note that a calculation function (or operator) such as "+" 
may also be used to perform a calculation and return a further processed object. 


It is important to mention that the result of each foreach-loop is the last call. Thus, in this example k will be added 
to the result. 


Parameter Details 


combine Function. Determines how the results of the loop are combined. Possible values are c, cbind, 


.combine . asa (wR 
rbind, "+", "*"... 
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if TRUE the result is ordered according to the order of the iteration vairable (here i). If FALSE the result 


Hnonder is not ordered. This can have postive effects on computation time. 


for functions which are provided by any package except base, like e.g. mass, randomForest or else, you 


Packages have to provide these packages with c("mass", "randomForest") 


Section 38.3: Random Number Generation 


A major problem with parallelization is the used of RNG as seeds. Random numbers by the number are iterated by 
the number of operations from either the start of the session or the most recent set.seed(). Since parallel 
processes arise from the same function, it can use the same seed, possibly causing identical results! Calls will run in 
serial on the different cores, provide no advantage. 


A set of seeds must be generated and sent to each parallel process. This is automatically done in some packages 
(parallel, snow, etc.), but must be explicitly addressed in others. 


s <- seed 
for (i in 1:numofcores) { 
Ss <- nextRNGStream(s) 
# send s to worker i as .Random.seed 


Seeds can be also be set for reproducibility. 


clusterSetRNGStream(cl = parallelcluster, iseed) 


Section 38.4: mcparallelDo 


The mcparallelDo package allows for the evaluation of R code asynchronously on Unix-alike (e.g. Linux and 
MacOSX) operating systems. The underlying philosophy of the package is aligned with the needs of exploratory 
data analysis rather than coding. For coding asynchrony, consider the future package. 


Example 
Create data 
data(ToothGrowth) 
Trigger mcparallelDo to perform analysis on a fork 


mcparallelDo({glm(len ~ supp * dose, data=ToothGrowth)}, "interactionPredictorModel" ) 


Do other things, e.g. 


binaryPredictorModel <- glm(len ~ supp, data=ToothGrowth) 
gaussianPredictorModel <- glm(len ~ dose, data=ToothGrowth) 


The result from mcparallelDo returns in your targetEnvironment, e.g. .GlobalEnv, when it is complete with a 
message (by default) 


summary (interactionPredictorModel) 


Other Examples 


# Example of not returning a value until we return to the top level 
for (i in 1:10) { 
if (i == 1) { 


Goalkicker.com - R Notes for Professionals 184 


mcparallelDo({2+2}, targetValue = "output") 


} 
if (exists("output")) print(i) 


} 


# Example of getting a value without returning to the top level 
for (i in 1:10) { 


Ef (a5 ==" Wr Xt 
mcparallelDo({2+2}, targetValue = "output") 
} 
mcparallelDoCheck( ) 
if (exists("output")) print(i) 


} 
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Chapter 39: Subsetting 


Given an R object, we may require separate analysis for one or more parts of the data contained in it. The process 


of obtaining these parts of the data from a given object is called subsetting. 
Section 39.1: Data frames 
Subsetting a data frame into a smaller data frame can be accomplished the same as subsetting a list. 


> df3 <- data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE) 


> df3 

## XaLY; 
## 11 
## 2 2b 
## 33.c 


> df3[1] # Subset a variable by number 


He x 
## 161 
HH Zee. 
## 3 3 


> df3["x"] # Subset a variable by name 


## x 
#4 11 
He 2 2. 
## 3 3 


> is.data.frame(df3[1]) 
## TRUE 


> is. list(df3[1]) 
## TRUE 


Subsetting a dataframe into a column vector can be accomplished using double brackets [[ ]] or the dollar sign 


operator $. 


> df3[[2]] # Subset a variable by number using [[ ]] 
AH wae “De ew 


> df3[["y"]] # Subset a variable by name using [[ ]] 
gee |pil| Re Vo et 


> df3$x # Subset a variable by name using $ 
wae (li) i 2 3) 


> typeof (df3$x) 
## "integer" 


> is.vector(df3$x) 
## TRUE 


Subsetting a data as a two dimensional matrix can be accomplished using i and j terms. 


> df3[1, 2] # Subset row and column by number 
HH lieean 
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> df3[1, "y"] # Subset row by number and column by name 
## [1] "a" 


> df3[2, ] # Subset entire row by number 
## x y 
## 2 2 db 


> df3[ , 1] # Subset all first variables 
## [1] 23 


— 


> df3[ , 1, drop = FALSE] 


## ns 
## 11 
Ht 2 2 
HH 3:3 


Note: Subsetting by j (column) alone simplifies to the variable's own type, but subsetting by i alone returns a 
data.frame, as the different variables may have different types and classes. Setting the drop parameter to FALSE 
keeps the data frame. 


> is.vector(df3[, 2]) 
## TRUE 


> is.data.frame(df3[2, ]) 
## TRUE 


> is.data.frame(df3[, 2, drop = FALSE]) 
## TRUE 


Section 39.2: Atomic vectors 


Atomic vectors (which excludes lists and expressions, which are also vectors) are subset using the [ operator: 


# create an example vector 
v1 ea eC a", Ube se Clas dies) 


# select the third element 
v1[3] 
ces (ill ow 


The [ operator can also take a vector as the argument. For example, to select the first and third elements: 
v1 me e(tane nie Hew Dral)) 


vife(1, 3)] 
gee (Pile were tre? 


Some times we may require to omit a particular value from the vector. This can be achieved using a negative sign(-) 
before the index of that value. For example, to omit to omit the first value from v1, use v1[-1]. This can be 
extended to more than one value in a straight forward way. For example, v1[-c(1,3) ]. 

> v1[-1] 

ai) ie oe ae 

> vi1[-e(1,3)] 

(ste be ede 


On some occasions, we would like to know, especially, when the length of the vector is large, index of a particular 
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value, if it exists: 


> vi=="c" 

[1] FALSE FALSE TRUE FALSE 
> which(v1=="c") 

[1] 3 


If the atomic vector has names (a names attribute), it can be subset using a character vector of names: 


VSS) 
names(v) <- c("one", 


two", "three") 


V 
## one two three 
## 1 2: 3 


v["two"] 
## two 
## 2 


The [[ operator can also be used to index atomic vectors, with differences in that it accepts a indexing vector with a 
length of one and strips any names present: 


v[[e(1, 2)]] 
HH Enotes mevil ke (la @2))il) 
## attempt to select more than one element in vectorIndex 


v[["two"]] 
Ce ae 


Vectors can also be subset using a logical vector. In contrast to subsetting with numeric and character vectors, the 
logical vector used to subset has to be equal to the length of the vector whose elements are extracted, so if a logical 
vector y is used to subset x, i.e. x[y], if length(y) < length(x) then y will be recycled to match length(x): 


v[c(TRUE, FALSE, TRUE) ] 
## one three 
## 1 3 


v[c(FALSE, TRUE)] # recycled to ‘'c(FALSE, TRUE, FALSE) ' 
## two 
## v2 


v [TRUE] # recycled to ‘'c(TRUE, TRUE, TRUE)' 
## one two three 


## 1 Z 3 


v [FALSE] # handy to discard elements but save the vector's type and basic structure 
## named integer(@) 


Section 39.3: Matrices 


For each dimension of an object, the [ operator takes one argument. Vectors have one dimension and take one 
argument. Matrices and data frames have two dimensions and take two arguments, given as [i, j] where iis the 
row and j is the column. Indexing starts at 1. 


## a sample matrix 
mat <- matrix(1:6, nrow = 2, dimnames = list(c("row1", "row2"), e("col1", "col2", "col3"))) 
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mat 

# coll col2 cols 
# rowl 1 3 5 
# row2 2 4 6 


mat[i, j] is the element in the i-th row, j-th column of the matrix mat. For example, an i value of 2 and a j value of 
1 gives the number in the second row and the first column of the matrix. Omitting i or j returns all values in that 
dimension. 


mat[ , 3] 
## rowl row2 
## 5 6 
mat[1, ] 


# coll col2 cols 
# 1 3 5 


When the matrix has row or column names (not required), these can be used for subsetting: 


mat[ , ‘col1'] 
# rowl row2 
# 1 2: 


By default, the result of a subset will be simplified if possible. If the subset only has one dimension, as in the 
examples above, the result will be a one-dimensional vector rather than a two-dimensional matrix. This default can 
be overridden with the drop = FALSE argument to [: 


## This selects the first row as a vector 
class(mat[1, ]) 
# [1] "integer" 


## Whereas this selects the first row as a 1x3 matrix: 
class(mat[1, , drop = F]) 
# [1] "matrix" 


Of course, dimensions cannot be dropped if the selection itself has two dimensions: 


mat[1:2, 2:3] ## A 2x2 matrix 
# col2 ‘cols 
# rowl 3 5 
# row2 4 6 


Selecting individual matrix entries by their positions 


It is also possible to use a Nx2 matrix to select N individual elements from a matrix (like how a coordinate system 
works). If you wanted to extract, in a vector, the entries of a matrix inthe (1st row, 1st column), (1st row, 3rd 
column), (2nd row, 3rd column), (2nd row, 1st column) this can be done easily by creating a index matrix 
with those coordinates and using that to subset the matrix: 


mat 

# coll col2: cols 
# rowl i] 3 5 
# row2 2 4 6 


ind = rbind(c(1, 1), ¢e(1, 3), ¢(2, 3), e(2, 1)) 
ind 


# Pil 2] 
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NN A A 


mat[ind] 
ce Mili etsy 22 


In the above example, the 1st column of the ind matrix refers to rows in mat, the 2nd column of ind refers to 
columns in mat. 


Section 39.4: Lists 


A list can be subset with [: 


Iie laste 2-3) 2 twor = e(Cae, ibe mcs) LSE (Gh 260i) 
11 

## [[1]] 

He [ll 1 2-3 

## 

## Stwo 

cae ALI] Pee ele Brot! 
te 

## [[3]] 

## [[3]][[1]] 

## [1] 10 

## 

## [[3]][[2]] 

## [1] 20 


11[1] 
## [[1]] 
## [1] 12 3 


11[ 'two' ] 
## Stwo 
## [jievae ibeece 


101211] 
ceraaltal |i eny lee Yen 


11[['two']] 
Halil de | bac 


Note the result of 11[2] is still a list, as the [ operator selects elements of a list, returning a smaller list. The [[ 
operator extracts list elements, returning an object of the type of the list element. 


Elements can be indexed by number or a character string of the name (if it exists). Multiple elements can be 
selected with [ by passing a vector of numbers or strings of names. Indexing with a vector of length > 1 in [ and 
[[ returns a "list" with the specified elements and a recursive subset (if available), respectively: 


itic(es 11 
peony 

oe OI EN 
## [1] 18 
#4 
 TLUII21 
## [1] 20 
#4 
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#4 
ue 1121 
## [1] 123 


Compared to: 


1i[[e(3, 1)]] 
## [1] 10 


which is equivalent to: 


ST 
## [1] 10 


The $ operator allows you to select list elements solely by name, but unlike [ and [[, does not require quotes. As an 


infix operator, $ can only take a single name: 


11S$two 
## | ea Li Li 


Also, the $ operator allows for partial matching by default: 


11St 
Hea) ae bec 


in contrast with [[ where it needs to be specified whether partial matching is allowed: 


i 
## NULL 

11[["t", exact = FALSE] ] 
Hal wal bweu 


Setting options(warnPartialMatchDollar = TRUE), a "warning" is given when partial matching happens with S$: 


11St 

Ho (ila be aCe 

## Warning message: 

#H In) LiSt =< partial’ match of “tt! tom “two: 


Section 39.5: Vector indexing 
For this example, we will use the vector: 


> X <- 11:20 
> xX 
2 Se aS 6 i718) 119) 2O 


R vectors are 1-indexed, so for example x[1] will return 11. We can also extract a sub-vector of x by passing a 
vector of indices to the bracket operator: 


> x[c(2,4,6)] 
a Pe ae as 


If we pass a vector of negative indices, R will return a sub-vector with the specified indices excluded: 
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> xle(-V 3) 
[1] 12 14 15 16:17 16-19 26 


We can also pass a boolean vector to the bracket operator, in which case it returns a sub-vector corresponding to 
the coordinates where the indexing vector is TRUE: 


> x[c(rep(TRUE, 5), rep(FALSE,5)) ] 
[A ae 13514 a6 


If the indexing vector is shorter than the length of the array, then it will be repeated, as in: 


> x[e(TRUE, FALSE) ] 

[tl 41.18 15: 17 13 

> x[e(TRUE, FALSE, FALSE) ] 
[1] 49 44 47 20 


Section 39.6: Other objects 


The [ and [[ operators are primitive functions that are generic. This means that any object in R (specifically 
isTRUE(is.object(x) ) --i.e. has an explicit "class" attribute) can have its own specified behaviour when subsetted; 
i.e. has its own methods for [ and/or [ [. 


For example, this is the case with "data.frame" (is.object(iris)) objects where [.data.frame and [[.data.frame 
methods are defined and they are made to exhibit both "matrix"-like and "list"-like subsetting. With forcing an error 
when subsetting a "data.frame", we see that, actually, a function [ .data. frame was called when we -just- used [. 


iris[invalidArgument, ] 
## Error in “[.data.frame (iris, invalidArgument, ) 
## = object ‘invalidArgument' not found 


Without further details on the current topic, an example[ method: 


x = structure(1:5, class = "myClass") 

xem 2 4)))] 

ae (i || 6) 22 

'[.myClass' = function(x, i) cat(sprintf("We'd expect '%s[%s]' to be returned but this a custom ~[- 


method and should have a ~?[.myClass* help page for its behaviour\n", deparse(substitute(x)), 
deparse(substitute(i) )) ) 


x[c(3, 2, 4)] 

## We'd expect ‘x[c(3, 2, 4)]' to be returned but this a custom “[* method and should have a 
*?[ .myClass* help page for its behaviour 

## NULL 


We can overcome the method dispatching of [ by using the equivalent non-generic .subset (and .subset2 for [ [). 
This is especially useful and efficient when programming our own "class"es and want to avoid work-arounds (like 
unclass(x)) when computing on our "class"es efficiently (avoiding method dispatch and copying objects): 


.subset(x, ¢(3, 2, 4)) 
ese iil) sh eZ 


Section 39.7: Elementwise Matrix Operations 


Let A and B be two matrices of same dimension. The operators +,-,/,*,* when used with matrices of same 
dimension perform the required operations on the corresponding elements of the matrices and return a new 
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matrix of the same dimension. These operations are usually referred to as element-wise operations. 


Operator A op B Meaning 
+ A+B _ Addition of corresponding elements of A and B 


- A-B_ Subtracts the elements of B from the corresponding elements of A 


/ A/B _ Divides the elements of A by the corresponding elements of B 
* A*B_ Multiplies the elements of A by the corresponding elements of B 
A A‘(-1) For example, gives a matrix whose elements are reciprocals of A 


For "true" matrix multiplication, as seen in Linear Algebra, use %*%. For example, multiplication of A with B is: A %*% 
B. The dimensional requirements are that the ncol() of Abe the same as nrow() of B 


Some Functions used with Matrices 


Function Example Purpose 
nrow() nrow(A) determines the number of rows of A 
ncol() ncol(A) determines the number of columns of A 


rownames() rownames(A) prints out the row names of the matrix A 

colnames() colnames(A) prints out the column names of the matrix A 

rowMeans() rowMeans(A) computes means of each row of the matrix A 

colMeans() colMeans(A) computes means of each column of the matrix A 

upper.tri() upper.tri(A) returns a vector whose elements are the upper 
triangular matrix of square matrix A 

lower.tri() lower.tri(A) returns a vector whose elements are the lower 
triangular matrix of square matrix A 


det() det(A) results in the determinant of the matrix A 
solve() solve(A) results in the inverse of the non-singular matrix A 
diag() diag(A) returns a diagonal matrix whose off-diagnal elemts are zeros and 


diagonals are the same as that of the square matrix A 
t() t(A) returns the the transpose of the matrix A 
eigen() eigen(A) retuens the eigenvalues and eigenvectors of the matrix A 
is.matrix() is.matrix(A) returns TRUE or FALSE depending on whether A is a matrix or not. 
as.matrix() as.matrix(x) creates a matrix out of the vector x 
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Chapter 40: Debugging 
Section 40.1: Using debug 


You can set any function for debugging with debug. 


debug (mean) 
mean(1:3) 


All subsequent calls to the function will enter debugging mode. You can disable this behavior with undebug. 


undebug (mean) 
mean(1:3) 


If you know you only want to enter the debugging mode of a function once, consider the use of debugonce. 


debugonce(mean) 
mean(1:3) 
mean(1:3) 


Section 40.2: Using browser 


The browser function can be used like a breakpoint: code execution will pause at the point it is called. Then user can 
then inspect variable values, execute arbitrary R code and step through the code line by line. 


Once browser() is hit in the code the interactive interpreter will start. Any R code can be run as normal, and in 
addition the following commands are present, 


Command Meaning 

C Exit browser and continue program 

f Finish current loop or function \ 

n Step Over (evaluate next statement, stepping over function calls) 
S Step Into (evaluate next statement, stepping into function calls) 
where Print stack trace 

r Invoke "resume" restart 

Q Exit browser and quit 


For example we might have a script like, 


toDebug <- function() { 


a= 1 
b=2 
browser () 


for(i in 1:100) { 
a=ax*b 
} 
} 


toDebug() 


When running the above script we initially see something like, 
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Called from: toDebug 
Browser[1]> 


We could then interact with the prompt as so, 


Called from: toDebug 

Browser[1]> a 

[ai 

Browser[1]> b 

Pill] 

Browse[1]> n 

debug at #7: for (i in 1:100) { 
a=ax*b 

} 

Browse[2]> n 

debug at #8: a=ax*b 

Browse[2]> a 

fe 

Browse[2]> n 

debug at #8: a=ax*b 

Browse[2]> a 

ii) 2 

Browse[2]> Q 


browser() can also be used as part of a functional chain, like so: 


mtcars %>% group_by(cyl) %>% {browser() } 
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Chapter 41: Installing packages 


Parameter Details 


pkgs character vector of the names of packages. If repos = NULL, a character vector of file paths. 
lib character vector giving the library directories where to install the packages. 
repos character vector, the base URL(s) of the repositories to use, can be NULL to install from local files 


method download method 
destdir — directory where downloaded packages are stored 


logical indicating whether to also install uninstalled packages which these packages depend on/link 


MEPCRCENclES to/import/suggest (and so on recursively). Not used if repos = NULL. 


Arguments to be passed to ‘download.file’ or to the functions for binary installs on OS X and 
Windows. 


Section 41.1: Install packages from GitHub 


To install packages directly from GitHub use the devtools package: 


library(devtools) 
install_github("authorName/repositoryName" ) 


To install ggplot2 from github: 
devtools: :install_github("tidyverse/ggplot2" ) 


The above command will install the version of ggplot2 that corresponds to the master branch. To install from a 
different branch of a repository use the ref argument to provide the name of the branch. For example, the 
following command will install the dev_general branch of the googleway package. 


devtools: :install_github("SymbolixAU/googleway", ref = "dev_general") 
Another option is to use the ghit package. It provides a lightweight alternative for installing packages from github: 


install.packages("ghit") 
ghit: :install_github("google/CausalImpact") 


To install a package that is in a private repository on Github, generate a personal access token at 
http://www.github.com/settings/tokens/ (See ?install_github for documentation on the same). Follow these steps: 


1. install.packages(c("curl", "httr")) 
2. config = httr::config(ssl_verifypeer = FALSE) 


3. install.packages("RCur1") 
options(RCurlOptions = c(getOption("RCurl0ptions"),ssl.verifypeer = FALSE, ssl.verifyhost = 
FALSE ) ) 


4. getOption("RCurl0Options") 
You should see the following: 


ssl.verifypeer ssl.verifyhost 


FALSE FALSE 
5. library(httr) 
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set_config(config(ssl_verifypeer = @L)) 


This prevents the common error: "Peer certificate cannot be authenticated with given CA certificates" 


6. Finally, use the following command to install your package seamlessly 


install_github("username/package_name", auth_token="abc" ) 


Alternatively, set an environment variable GITHUB_PAT, using 


Sys.setenv(GITHUB_PAT = "“access_token") 
devtools: :install_github("organisation/package_name" ) 


The PAT generated in Github is only visible once, i.e., when created initially, so its prudent to save that token in 
-Rprofile. This is also helpful if the organisation has many private repositories. 


Section 41.2: Download and install packages from 
repositories 


Packages are collections of R functions, data, and compiled code in a well-defined format. Public (and private) 
repositories are used to host collections of R packages. The largest collection of R packages is available from CRAN. 


Using CRAN 

A package can be installed from CRAN using following code: 
install.packages("dplyr") 

Where "dplyr" is referred to as a character vector. 


More than one packages can be installed in one go by using the combine function c() and passing a series of 
character vector of package names: 


install.packages(c("dplyr", "tidyr", "“ggplot2")) 


In some cases, install. packages may prompt for a CRAN mirror or fail, depending on the value of 
getOption("repos"). To prevent this, specify a CRAN mirror as repos argument: 


install.packages("dplyr", repos = "https://cloud.r-project.org/") 


Using the repos argument it is also possible to install from other repositories. For complete information about all 
the available options, run ?install. packages. 


Most packages require functions, which were implemented in other packages (e.g. the package data. table). In 
order to install a package (or multiple packages) with all the packages, which are used by this given package, the 
argument dependencies should be set to TRUE): 


install.packages("data.table", dependencies = TRUE) 


Using Bioconductor 


Bioconductor hosts a substantial collection of packages related to Bioinformatics. They provide their own package 
management centred around the biocLite function: 
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## Try http:// if https:// URLs are not supported 
source("https://bioconductor.org/biocLite.R") 
biocLite() 


By default this installs a subset of packages that provide the most commonly used functionality. Specific packages 
can be installed by passing a vector of package names. For example, to install RImmPort from Bioconductor: 


source("https://bioconductor.org/biocLite.R") 
biocLite("RImmPort" ) 


Section 41.3: Install package from local source 
To install package from local source file: 
install.packages(path_to_source, repos = NULL, type="source" 
install.packages("~/Downloads/dplyr-master.zip", repos=NULL, type="source" 
Here, path_to_source is absolute path of local source file. 
Another command that opens a window to choose downloaded zip or tar.gz source files is: 


install.packages(file.choose(), repos=NULL) 


Another possible way is using the GU/ based RStudio: 

Step 1: Go to Tools. 

Step 2: Go to Install Packages. 

Step 3: In the /nstall From set it as Package Archive File (.zip; .tar.gz) 


Step 4: Then Browse find your package file (say crayon_1.3.1.zip) and after some time (after it shows the Package path 
and file name in the Package Archive tab) 


Another way to install R package from local source is using install_local() function from devtools package. 


library(devtools) 
install_local("~/Downloads/dplyr-master .zip") 


Section 41.4: Install local development version of a package 


While working on the development of an R package it is often necessary to install the latest version of the package. 
This can be achieved by first building a source distribution of the package (on the command line) 


R CMD build my_package 
and then installing it in R. Any running R sessions with previous version of the package loaded will need to reload it. 


unloadNamespace("my_package" ) 
library(my_package) 


A more convenient approach uses the devtools package to simplify the process. In an R session with the working 
directory set to the package directory 


Goalkicker.com - R Notes for Professionals 198 


devtools: :install() 


will build, install and reload the package. 


Section 41.5: Using a CLI package manager -- basic pacman 
usage 


pacman is a simple package manager for R. 


pacman allows a user to compactly load all desired packages, installing any which are missing (and their 
dependencies), with a single command, p_load. pacman does not require the user to type quotation marks around a 
package name. Basic usage is as follows: 


p_load(data.table, dplyr, ggplot2) 
The only package requiring a library, require, or install.packages statement with this approach is pacman itself: 


library(pacman) 
p_load(data.table, dplyr, ggplot2) 


or, equally valid: 
pacman: :p_load(data.table, dplyr, ggplot2) 


In addition to saving time by requiring less code to manage packages, pacman also facilitates the construction of 
reproducible code by installing any needed packages if and only if they are not already installed. 


Since you may not be sure if pacman is installed in the library of a user who will use your code (or by yourself in 
future uses of your own code) a best practice is to include a conditional statement to install pacman if it is not 
already loaded: 


if(!(require(pacman)) install.packages("pacman" ) 
pacman: :p_load(data.table, dplyr, ggplot2) 
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Chapter 42: Inspecting packages 


Packages build on base R. This document explains how to inspect installed packages and their functionality. Related 
Docs: Installing packages 


Section 42.1: View Package Version 
Conditions: package should be at least installed. If not loaded in the current session, not a problem. 


## Checking package version which was installed at past or 
## installed currently but not loaded in the current session 


packageVersion("seqinr") 
is (Mil) Sisshosk 
packageVersion( "RWeka" ) 
Fea Al Ci4s2 92 


Section 42.2: View Loaded packages in Current Session 
To check the list of loaded packages 

search() 

OR 


(.packages() ) 


Section 42.3: View package information 
To retrieve information about dplyr package and its functions’ descriptions: 
help(package = “dplyr") 
No need to load the package first. 
Section 42.4: View package's built-in data sets 
To see built-in data sets from package dplyr 
data(package = "dplyr") 
No need to load the package first. 
Section 42.5: List a package's exported functions 
To get the list of functions within package dplyr, we first must load the package: 


library(dplyr) 
1s("package:dplyr") 
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Chapter 43: Creating packages with 
devtools 


This topic will cover the creation of R packages from scratch with the devtools package. 


Section 43.1: Creating and distributing packages 


This is a compact guide about how to quickly create an R package from your code. Exhaustive documentations will 
be linked when available and should be read if you want a deeper knowledge of the situation. See Remarks for more 
resources. 


The directory where your code stands will be referred as ./, and all the commands are meant to be executed from 
a R prompt in this folder. 


Creation of the documentation 
The documentation for your code has to be in a format which is very similar to LaTex. 


However, we will use a tool named roxygen in order to simplify the process: 


install.packages("devtools") 
library("devtools") 
install.packages("roxygen2") 
library("roxygen2") 


The full man page for roxygen is available here. It is very similar to doxygen. 


Here is a practical sample about how to document a function with roxygen: 


#' Increment a variable. 

#! 

#' Note that the behavior of this function 
#' is undefined if ~x° is not of class “numeric. 
#! 

#' @export 

#' @author another guy 

#' @name Increment Function 

#' @title increment 

#! 

#' @param x Variable to increment 

#' @return *x*’ incremented of 1 

#! 

#' @seealso “other_function~ 

#! 

#' @examples 

#' increment(3) 

#' >4 


increment <- function(x) { 
return (x+1) 


} 


And here will be the result. 


It is also recommanded to create a vignette (see the topic Creating vignettes), which is a full guide about your 
package. 
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Construction of the package skeleton 


Assuming that your code is written for instance in files ./script1.R and ./script2.R, launch the following 
command in order to create the file tree of your package: 


package.skeleton(name="MyPackage", code_files=c("script1.R", "script2.R")) 


Then delete all the files in . /MyPackage/man/. You have now to compile the documentation: 
roxygenize("MyPackage" ) 


You should also generate a reference manual from your documentation using R CMD Rd2pdf MyPackage froma 
command prompt started in ./. 


Edition of the package properties 
1. Package description 


Modify . /MyPackage/DESCRIPTION according to your needs. The fields Package, Version, License, Description, 
Title, Author and Maintainer are mandatory, the other are optional. 


If your package depends on others packages, specify them in a field named Depends (R version < 3.2.0) or Imports (R 
version > 3.2.0). 


2. Optional folders 


Once you launched the skeleton build, ./MyPackage/ only had R/ and man/ subfolders. However, it can have some 
others: 


data/: here you can place the data that your library needs and that isn't code. It must be saved as dataset 
with the .RData extension, and you can load it at runtime with data() and load() 

tests/: all the code files in this folder will be ran at install time. If there is any error, the installation will fail. 
src/: for C/C++/Fortran source files you need (using Rcpp...). 

exec/: for other executables. 

misc/: for barely everything else. 


Finalization and build 

You can delete . /MyPackage/Read-and-delete-me. 

As it is now, your package is ready to be installed. 

You can install it with devtools: :install("MyPackage"). 


To build your package as a source tarball, you need to execute the following command, from a command prompt in 
./:R CMD build MyPackage 


Distribution of your package 
Through Github 


Simply create a new repository called MyPackage and upload everything in MyPackage/ to the master branch. Here 
is an example. 
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Then anyone can install your package from github with devtools: 


install_package("MyPackage", "your_github_usename" ) 


Through CRAN 


Your package needs to comply to the CRAN Repository Policy. Including but not limited to: your package must be 
cross-platforms (except some very special cases), it should pass the R CMD check test. 


Here is the submission form. You must upload the source tarball. 


Section 43.2: Creating vignettes 


A vignette is a long-form guide to your package. Function documentation is great if you know the name of 
the function you need, but it’s useless otherwise. A vignette is like a book chapter or an academic paper: 


it can describe the problem that your package is designed to solve, and then show the reader how to 
solve it. 


Vignettes will be created entirely in markdown. 


Requirements 


e Rmarkdown: install. packages("rmarkdown" ) 
e Pandoc 


Vignette creation 


devtools: :use_vignette("MyVignette", "MyPackage") 


You can now edit your vignette at . /vignettes/MyVignette.Rmd. 
The text in your vignette is formatted as Markdown. 


The only addition to the original Markdown, is a tag that takes R code, runs it, captures the output, and translates it 
into formatted Markdown: 


Oey 

# Add two numbers together 
add <- function(a, b) a+b 
add(18, 20) 


Will display as: 


# Add two numbers together 
add <- function(a, b) a+b 
add(18, 2@) 

## [1] 30 


Thus, all the packages you will use in your vignettes must be listed as dependencies in . /DESCRIPTION. 
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Chapter 44: Using pipe assignment in your 
own package ye soe: How to ? 


In order to use the pipe in a user-created package, it must be listed in the NAMESPACE like any other function you 
choose to import. 


Section 44.1: Putting the pipe in a utility-functions file 


One option for doing this is to export the pipe from within the package itself. This may be done in the ‘traditional’ 
zzz.R or utils.R files that many packages utilise for useful little functions that are not exported as part of the 
package. For example, putting: 


' Pipe operator 


# 

# 

#' @name %>% 

#' @rdname pipe 

#' @keywords internal 
#' @export 

#' @importFrom magrittr %>% 
#' @usage lhs \%>\% rhs 
NULL 


GoalKicker.com - R Notes for Professionals 204 


Chapter 45: Arima Models 


Section 45.1: Modeling an AR1 Process with Arima 
We will model the process 
It = .724-1 + € e~ N(0,1) 


#Load the forecast package 
library(forecast) 


#Generate an AR1 process of length n (from Cowpertwait & Meltcalfe) 
# Set up variables 

set.seed(1234) 

n <- 1000 

x <- matrix(@,1000,1) 

Ww <- rnorm(n) 


# loop to create x 


for (t in 2:n) x[t] <- 0.7 * x[t-1] + w[t] 
plot(x, type='1' ) 
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We will fit an Arima model with autoregressive order 1, 0 degrees of differencing, and an MA order of 0. 


0 200 400 600 800 


Index 


#Fit an AR1 model using Arima 
fit <- Arima(x, order = c(1, @, Q)) 


summary (fit) 

# SeGLeSi: Xx 

# ARIMA(1,0,@) with non-zero mean 

# 

# Coefficients: 

# ar1 intercept 

# @.7040 -@.0842 

#s.e. 0.0224 @.1062 

# 

# sigma‘2 estimated as 9.9923: log likelihood=-1415.39 

# AIC=2836.79 AICc=2836.81 BIC=2851 .51 

# 

# Training set error measures: 

# ME RMSE MAE MPE MAPE MASE ACF1 
# Training set -8.369365e-@5 8.9961194 @.7835914 Inf Inf 0.91488 @.02263595 
# Verify that the model captured the true AR parameter 


Notice that our coefficient is close to the true value from the generated data 
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1000 


206 


fitScoef[1] 
# art 
# @.7040085 


#Verify that the model eliminates the autocorrelation 
acf (x) 


Series 1 


1.0 


0.8 


0.6 


ACF 


0.4 


0.2 


0.0 


acf(fitSresid) 
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Series fit$resid 


1.0 


0.8 


0.6 


ACF 


0.4 


0.2 


0.0 


#Forecast 10 periods 
fest <- forecast(fit, h = 100) 
fest 

Point Forecast Lo 80 Hi 80 Lo 95 Hi 95 
1001 @.282529070 -@.9940493 1.559107 -1.669829 2.234887 
1002 @.173976408 -1.3872262 1.735179 -2.213677 2.561630 
1003 @.097554408 -1.5869850 1.782094 -2.478726 2.673835 
1004 @.043752667 -1.6986831 1.786188 -2.621073 2.708578 
1005 @.005875783 -1.7645535 1.776305 -2.701762 2.713514 


#Call the point predictions 
fcstS$mean 
# Time Series: 
# Start = 1001 
# End = 1100 
# Frequency = 1 
[1] 0.282529070 0.173976408 0.0975544808 @.043752667 @.005875783 -@.020789866 -@.039562711 
-@.052778954 
[9] -@.062083302 
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#Plot the forecast 
plot(fcst) 


Forecasts from ARIMA(1,0,0) with non-zero mean 


0 200 400 600 800 1000 
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Chapter 46: Distribution Functions 


R has many built-in functions to work with probability distributions, with official docs starting at ?7Distributions. 


Section 46.1: Normal distribution 


Let's use *normas an example. From the documentation: 


dnorm(x, mean = 
pnorm(q, mean = 
qnorm(p, mean = 
rnorm(n, mean = 


, sd = 1, log = FALSE) 

sd = 1, lower.tail = TRUE, log.p = FALSE) 
, sd = 1, lower.tail = TRUE, log.p = FALSE) 
, sd 1 


oO 0 ® 


) 
So if | wanted to know the value of a standard normal distribution at 0, | would do 
dnorm(@) 


Which gives us @.3989423, a reasonable answer. 
In the same way pnorm(®) gives .5. Again, this makes sense, because half of the distribution is to the left of 0. 
qnorm will essentially do the opposite of pnorm. qnorm( .5) gives 8. 


Finally, there's the rnorm function: 
rnorm(10) 


Will generate 10 samples from standard normal. 


If you want to change the parameters of a given distribution, simply change them like so 


rnorm(1@, mean=4, sd= 3) 


Section 46.2: Binomial Distribution 
We now illustrate the functions dbinom,pbinom,qbinom and rbinom defined for Binomial distribution. 


The dbinom() function gives the probabilities for various values of the binomial variable. Minimally it requires three 
arguments. The first argument for this function must be a vector of quantiles(the possible values of the random 
variable x). The second and third arguments are the defining parameters of the distribution, namely, n(the 
number of independent trials) and p(the probability of success in each trial). For example, for a binomial 
distribution with n = 5,p = @.5, the possible values for X are 8,1,2,3,4,5. Thatis, the dbinom(x,n,p) function 
gives the probability values P( X = x ) forx = 8, 1, 2, 3, 4, 5. 


#Binom(n = 5, p = @.5) probabilities 

Sine <—.o8 ipe—) O25) x <— O:in 

> dbinom(x,n,p) 

[1] 8.03125 8.15625 @.31250 8.31250 8.15625 8.03125 
#To verify the total probability is 1 

> sum(dbinom(x,n, p) ) 

Ie a 

> 


The binomial probability distribution plot can be displayed as in the following figure: 
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SX. <> O12 
> prob <- dbinom(x, 12, .5) 
> barplot(prob,col = "red",ylim = c(0,.2),names.arg=x, 
main="Binomial Distribution\n(n=12, p=0@.5)") 


Binomial Distribution 
(n=125=0.5) 


0123 45 6 Ff 8 9 10 711 12 


0.00 005 0.10 015 0.20 


Note that the binomial distribution is symmetric when p = @.5. To demonstrate that the binomial distribution is 
negatively skewed when p is larger than @.5, consider the following example: 


> n=9; p=.7; x=@0:n; prob=dbinom(x,n,p); 
> barplot(prob,names.arg = x,main="Binomial Distribution\n(n=9, p=0.7)",col="lightblue") 


Binomial Distribution 
(n=9, p=0.7) 


0.20 


0.10 


0.00 


When p is smaller than @.5 the binomial distribution is positively skewed as shown below. 
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> n=9; p=.3; x=@:n; prob=dbinom(x,n,p); 
> barplot(prob,names.arg = x,main="Binomial Distribution\n(n=9, p=@.3)",col="cyan") 


Binomial Distribution 
(n=9, p=0.3) 
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0.10 0.20 


0.00 


We will now illustrate the usage of the cumulative distribution function pbinom(). This function can be used to 
calculate probabilities such as P( X <= x ). The first argument to this function is a vector of quantiles(values of x). 


# Calculating Probabilities 

# P(X <= 2) in a Bin(n=5,p=0.5) distribution 
> pbinom(2,5,9.5) 

[1] 9.5 


The above probability can also be obtained as follows: 

#OP(e <= 2): = P(NSO) + (X=) + P(X=2) 

> sum(dbinom(@:2,5,9.5)) 

le 
To compute, probabilities of the type: P( a <= X <= b ) 

# P(3<= X <= 5) = P(X=3) + P(X=4) + P(X=5) in a Bin(n=9,p=@.6) dist 
> sum(dbinom(c(3,4,5),9,9.6)) 


[1] @.4923556 


> 


Presenting the binomial distribution in the form of a table: 


> n= 10; p = @.4; x = O:n; 
> prob = dbinom(x,n,p) 
> cdf = pbinom(x,n,p) 
> distTable = cbind(x,prob, cdf) 
> distTable 
x prob cdf 


[1,] @ @.0060466176 @.006046618 
[2,] 1 @.0483107840 @.046357402 
[3,] 2 @.1289323520 @.167289754 


GoalkKicker.com - R Notes for Professionals 22 


[4,] 3 @.2149908480 
[5,] 4 @.2508226560 
[6,] 5 @.2006581248 
[7,] 6 @.1114767360 
[8,] 7 @.0424673280 
[9,] 8 @.0186168320 
[18@,] 9 @.0815728648 
[11,] 10 @.0801048576 


The rbinom() is used to generate random samples of specified sizes with a given parameter values. 


# Simulation 


-"-9O0 90000 0 © 


- 382280602 
FO331/03256 
. 833761382 
-945238118 
-987705446 
998322278 
999895142 
. 080000000 


> xVal<-names(table(rbinom(1000,8, .5))) 
> barplot(as.vector(table(rbinom(1000,8,.5))),names.arg =xVal, 
main="Simulated Binomial Distribution\n (n=8, p=@.5)") 


Simulated Binomial Distribution 
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150 


0 


(n=8,p=0.5) 


_ = 


2 3 a 5 6 z 8 
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Chapter 47: Shiny 


Section 47.1: Create an app 


Shiny is an R package developed by RStudio that allows the creation of web pages to interactively display the results 
of an analysis in R. 


There are two simple ways to create a Shiny app: 


e in one .R file, or 
e in two files: ui.R and server .R. 


A Shiny app is divided into two parts: 


e ui: A user interface script, controlling the layout and appearance of the application. 
e server: A server script which contains code to allow the application to react. 


One file 
library(shiny) 


# Create the UI 

ui <- shinyUI(fluidPage( 
# Application title 
titlePanel( "Hello World!") 


)) 


# Create the server function 
server <- shinyServer(function(input, output) {}) 


# Run the app 

shinyApp(ui = ui, server = server) 
Two files 

Create ui.R file 

library(shiny) 

# Define UI for application 
shinyUI ( fluidPage( 


# Application title 
titlePanel("Hello World!") 


)) 


Create server .R file 


library(shiny) 


# Define server logic 
shinyServer(function(input, output) {}) 


Section 47.2: Checkbox Group 


Create a group of checkboxes that can be used to toggle multiple choices independently. The server will receive the 
input as a character vector of the selected values. 


library(shiny) 


ui <- fluidPage( 
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checkboxGroupInput("checkGroup1", label = 
choices*=ist(@ 1h = 11, 22” 
selected = 1), 

fluidRow(column(3, verbatimTextOutput("text_choice" ))) 


) 


Dy eo = 3). 


server <- function(input, output) { 
output$text_choice <- renderPrint({ 
return(paste@("You have chosen the choice ", inputScheckGroup1) )}) 


} 


shinyApp(ui = ui, server = server) 


This is a Checkbox group 


¢ 


No 


wo 


[1] “You have chosen the choice 1" 


It's possible to change the settings : 


e label : title 

e choices : selected values 

e selected : The initially selected value (NULL for no selection) 
e inline : horizontal or vertical 

e width 


It is also possible to add HTML. 


Section 47.3: Radio Button 


You can create a set of radio buttons used to select an item from a list. 
It's possible to change the settings : 


e selected : The initially selected value (character(0) for no selection) 
e inline : horizontal or vertical 
e width 


It is also possible to add HTML. 


library(shiny) 


ui <- fluidPage( 
radioButtons("radio", 


h3("This is a Checkbox group"), 


label = HTML('<FONT color="red"><FONT size="5pt">Welcome</FONT></FONT><br> <b>Your 


favorite color is red ?</b>'), 
choices = list("TRUE" = 1, "FALSE" = 2), 
selected = 1, 
inline = T, 
width = "100%"), 
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fluidRow(column(3, textOutput("value")))) 


server <- function(input, output) { 
output$value <- renderPrint({ 
if(input$radio == 1){return('Great !')} 
else{return("Sorry !")}})} 


shinyApp(ui = ui, server = server) 


Welcome 


Your favorite color is red ? 
® TRUE FALSE 


[1] "Great !" 


Section 47.4: Debugging 


debug() and debugonce() won't work well in the context of most Shiny debugging. However, browser() statements 
inserted in critical places can give you a lot of insight into how your Shiny code is (not) working. See also: Debugging 
using browser() 


Showcase mode 


Showcase mode displays your app alongside the code that generates it and highlights lines of code in server.R as it 
runs them. 


There are two ways to enable Showcase mode: 


e Launch Shiny app with the argument display.mode = "showcase", e.g., runApp("MyApp", display.mode = 
"showcase" ). 
e Create file called DESCRIPTION in your Shiny app folder and add this line in it: DisplayMode: Showcase. 


Reactive Log Visualizer 


Reactive Log Visualizer provides an interactive browser-based tool for visualizing reactive dependencies and 
execution in your application. To enable Reactive Log Visualizer, execute options(shiny.reactlog=TRUE) inR 
console and or add that line of code in your server.R file. To start Reactive Log Visualizer, hit Ctrl+F3 on Windows or 
Command+F3 on Mac when your app is running. Use left and right arrow keys to navigate in Reactive Log Visualizer. 


Section 47.5: Select box 


Create a select list that can be used to choose a single or multiple items from a list of values. 


library(shiny) 


ui <- fluidPage( 
selectInput("id_selectInput", 
label = HTML('<B><FONT size="3">What is your favorite color ?</FONT></B>' ) 
multiple = TRUE, 


choices = list("red" = "red", "green" = "green", "blue" = "blue", "yellow" = "yellow"), 
selected = NULL), 
br(), br(), 


fluidRow(column(3, textOutput("text_choice")))) 
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server <- function(input, output) { 
outputStext_choice <- renderPrint({ 
return(inputSid_selectInput) }) 
} 


shinyApp(ui = ui, server = server) 


What is your favorite color ? 


red green blue 


yellow 


[1] "red" "green” "blue" 


It's possible to change the settings : 


e label : title 

choices : selected values 

e selected : The initially selected value (NULL for no selection) 

multiple : TRUE or FALSE 

e width 

e size 

selectize: TRUE or FALSE (for use or not selectize.js, change the display) 


It is also possible to add HTML. 


Section 47.6: Launch a Shiny app 


You can launch an application in several ways, depending on how you create you app. If your app is divided in two 
files ui.R and server .R or if all of your app is in one file. 


1. Two files app 


Your two files ui.R and server .Rhave to be in the same folder. You could then launch your app by running in the 
console the shinyApp() function and by passing the path of the directory that contains the Shiny app. 


shinyApp("path_to_the_folder_containing_the_files" ) 


You can also launch the app directly from Rstudio by pressing the Run App button that appear on Rstudio when 
you an ui.R or server .R file open. 


Ol-| Sy 55)* Addins ~ 


@ uiR 9") server.R — 


T Qa --l\z Run App ee w+ 
Seca Hd 


Or you can simply write runApp() on the console if your working directory is Shiny App directory. 
2. One file app 
If you create your in one R file you can also launch it with the shinyApp() function. 


e inside of your code: 


Goalkicker.com - R Notes for Professionals 217 


library(shiny) 


ui <- fluidPage() #Create the ui 
server <- function(input, output){} #create the server 


shinyApp(ui = ui, server = server) #run the App 


e in the console by adding path to a .R file containing the Shiny application with the parameter appFile: 


shinyApp(appFile="path_to_my_R_file_containig_the_app") 


Section 47.7: Control widgets 


Function Widget 
actionButton Action Button 


checkboxGroupInput A group of check boxes 


checkboxInput A single check box 

datelnput A calendar to aid date selection 
dateRangelnput A pair of calendars for selecting a date range 
filelnput A file upload control wizard 

helpText Help text that can be added to an input form 
numericinput A field to enter numbers 

radioButtons A set of radio buttons 

selectInput A box with choices to select from 

sliderlnput A slider bar 

submitButton A submit button 

textInput A field to enter text 

library(shiny) 


# Create the UI 
ui <- shinyUI(fluidPage( 
titlePanel("Basic widgets"), 


f luidRow( 


column(3, 
h3("Buttons"), 
actionButton("action", label = "Action"), 
br(), 


br(), 
submitButton("Submit")), 


column(3, 
h3("Single checkbox"), 
checkboxInput( "checkbox", label = "Choice A", value = TRUE)), 


column(3, 
checkboxGroupInput("checkGroup", 
label = h3("Checkbox group") 
choices = list("Choice 1" le 
iChorcer 2. =— 2 Chowce: 3) (=).3))" 
selected = 1)), 


column(3, 
dateInput("date", 
label = h3("Date input"), 
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value = "2014-@1-0@1")) 


iy 
f luidRow( 
column(3, 
dateRangeInput("dates", label = h3("Date range"))), 
column(3, 
fileInput("file", label = h3("File input") )), 
column(3, 
h3("Help text"), 
helpText("Note: help text isn't a true widget,", 
"but it provides an easy way to add text to", 
“accompany other widgets.")) 
column(3, 
numericInput("“num", 
label = h3("Numeric input"), 
value = 1)) 
ye 
f luidRow( 
column(3, 
radioButtons("radio", label = h3("Radio buttons"), 
choices = list("Choice 1" = 1, "Choice 2" = 2, 
"Choice 3" = 3),selected = 1)), 
column(3, 
selectInput("select", label = h3("Select box"), 
choices = list("Choice 1" = 1, "Choice 2" = 2, 
"Choice 3" = 3), selected = 1)), 
column(3, 
sliderInput("slider1", label = h3("Sliders"), 
min = 0, max = 100, value = 5@), 
sliderInput("slider2", "", 
min = 0, max = 100, value = c(25, 75)) 
), 
column(3, 
textInput("text", label = h3("Text input"), 
value = "Enter text...")) 
) 
)) 


# Create the server function 
server <- shinyServer(function(input, output) {}) 


# Run the app 
shinyApp(ui = ui, server = server) 
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Chapter 48: spatial analysis 


Section 48.1: Create spatial points from XY data set 


When it comes to geographic data, R shows to be a powerful tool for data handling, analysis and visualisation. 


Often, spatial data is avaliable as an XY coordinate data set in tabular form. This example will show how to create a 
spatial data set from an XY data set. 


The packages rgdal and sp provide powerful functions. Spatial data in R can be stored as Spatial*DataFrame 
(where * can be Points, Lines or Polygons). 


This example uses data which can be downloaded at OpenGeocode. 


At first, the working directory has to be set to the folder of the downloaded CSV data set. Furthermore, the package 
rgdal has to be loaded. 


setwd("D:/GeocodeExample/") 
library(rgdal) 


Afterwards, the CSV file storing cities and their geographical coordinates is loaded into R as a data. frame 
xy <- read.csv("worldcities.csv", stringsAsFactors = FALSE) 
Often, it is useful to get a glimpse of the data and its structure (e.g. column names, data types etc.). 


head(xy) 
str(xy) 


This shows that the latitude and longitude columns are interpreted as character values, since they hold entries like 
"-33,532". Yet, the later used function SpatialPointsDataFrame() which creates the spatial data set requires the 
coordinate values to be of the data type numeric. Thus the two columns have to be converted. 


xy$latitude <- as.numeric(xyS$latitude) 
xy$longitude <- as.numeric(xyS$longitude) 


Few of the values cannot be converted into numeric data and thus, NA values are created. They have to be removed. 
xy <- xy[!is.na(xy$longitude), ] 


Finally, the XY data set can be converted into a spatial data set. This requires the coordinates and the specification 
of the Coordinate Refrence System (CRS) in which the coordinates are stored. 


xySPoints <- SpatialPointsDataFrame(coords = c(xy[,c("longitude", “latitude")]), 
proj4string = CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"), 
data = xy 


) 
The basic plot function can easily be used to sneak peak the produced spatial points. 


plot(xySPoints, pch = ".") 
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Section 48.2: Importing a shape file (.shp) 
rgdal 
ESRI shape files can easily be imported into R by using the function readOGR() from the rgdal package. 


library(rgdal) 
shp <- readORG(dsn = "/path/to/your/file", layer = "filename") 


It is important to know, that the dsn must not end with / and the layer does not allow the file ending (e.g. . shp) 
raster 
Another possible way of importing shapefiles is via the raster library and the shapefile function: 


library(raster) 
shp <- shapefile("path/to/your/file.shp") 


Note how the path definition is different from the rgdal import statement. 
tmap 
tmap package provides a nice wrapper for the rgdal: : readORG function. 


library(tmap) 
sph <- read_shape("path/to/your/file.shp" ) 
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Chapter 49: sqldf 


Section 49.1: Basic Usage Examples 


sqldf() from the package sqldf allows the use of SQLite queries to select and manipulate data in R. SQL queries 
are entered as character strings. 


To select the first 10 rows of the "diamonds" dataset from the package ggplot2, for example: 


data( "diamonds" ) 


NNNNNNNNDN NY 


2 


NNNN NM 


Z 


.43 
aol 
“31 
.63 
a TAS 
-48 


head(diamonds) 
# A tibble: 6 x 10 
carat cut color clarity depth table price x y 
<db1> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> 
1 Oe23 Ideal E ST27 s6sin'5 bbe 2268 3,90) 3298 
20241 Premium E STi 59.8 61 326 3.89 3.84 
33023 Good E VS1 56.9 65) 327 A.G5: “4307 
4 @.29 Premium I VS2 62.4 5S e340 4-26) 34.23 
5 G34 Good J STI2 > (63-3 58° 335) 4534) 4535 
6 @.24 Very Good J VVS2 62.8 BY 330) 13,943.96 
require(sqldf) 
sqldf("select * from diamonds limit 10") 
carat cut color clarity depth table price x y 
1 @.23 Ideal E SI25 615 59) 3826°3.95 3-98 
2 OR 21 Premium E SIi 59.8 61 32653)89) 3.84 
3° O23 Good E VS1 56.9 65) 3274205. 4507 
4 @.29 Premium I VS2 62.4 58) 334-4..20) 4:23 
5 6.34 Good J SI2 63.3 58.) 335) 4534: 4°35 
6 @.24 Very Good J VVS2 62.8 57 336,394 3 496 
7  @.24 Very Good I VVS1 62.3 57 |) 33673295) 3.98 
8 @.26 Very Good H SI1 61.9 5S) 337 An07 Aa ilit 
Oo 16:.22 Fair E VS2 65.1 61 337) 3aG7 3.78 
10 @.23 Very Good H VS1 59.4 61 338 4.00 4.05 
To select the first 10 rows where for the color "E": 
sqldf("select * from diamonds where color = 'E' limit 10") 
carat cut color clarity depth table price x y 
1 @.23 Ideal E SI2 61.5 55 326: 3:..951.3).98 
Ze Ok Premium E SIi 59.8 61 326 3.89 3.84 
See O23 Good E VS1 56.9 65.) 327 4205) 4507 
4 @.22 Fair E VS2 65.1 61 3373502 S78 
5 @.2@ Premium E SI2 60.2 620) 34573279 sag) 
6 6.32 Premium E Ii 60.9 5B 3457-4538 4542 
7  @.23 Very Good E VS2 63.8 55” 352 -3.05°-3:.902 
8 @.23 Very Good E VS1 60.7 59 402 3.97 4.01 
9 @.23 Very Good E VS1 59.5 58 402 4.01 4.06 
10 0.23 Good E VS1 64.1 59 462 3.8353485 


Notice in the example above that quoted strings within the SQL query are quoted using " if the overall query is 


quoted with "" (this also works in reverse). 
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Suppose that we wish to add a new column to count the number of Premium cut diamonds over 1 carat: 


sqldf("select count(*) from diamonds where carat > 1 and color = 


count (*) 
1892 


1 


Results of created values can also be returned as new columns: 


sqldf("select *, count(*) as cnt_big_E_colored_stones 
group by clarity") 


carat 


1 


ON A AB WYN = 
Se a eee he 


If one would be interested what is the max price of the diamond according to the cut: 


sqldf("select cut, 


200 
28 
302: 
503 
ool 
ae. 
.28 
Zoo 


cut color clarity depth table price 


Fair 
Ideal 
Very Good 
Premium 
Ideal 
Very Good 
Ideal 
Ideal 


cut max(price) 


1 Fair 
2 Good 
3 Ideal 
4 Premium 
5 Very Good 
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18574 
18788 
18806 
18823 
18818 


E 


E 
E 
E 
E 
E 
E 
E 


T1 
IF 
ST1 
S12 
VS1 
VS2 
VVS1 
VVS2 


66. 
60. 
Oo). 
61. 
61. 
63. 
61. 
62. 


5 


ooh oo oO N 


58 
SY 
39 
BY) 
SY 
56 
56 
39 


2571 
18700 
18731 
18477 
18729 
18557 
16256 
18188 


fee) 


from diamonds where carat > 1 and color 


NON NOON O 


«38 


max(price) from diamonds group by cut") 


SJ) 6v “Ji SI © co ‘Oy ‘Oy 


BPR HRORAAHA 


Z cnt_big_E_colored_stones 


BOG 
peal 
.88 
.04 
ao) 
noe, 
eae 
62: 


65 
28 
499 
666 
158 
318 
Sy 
106 


'E! 
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Chapter 50: Code profiling 


Section 50.1: Benchmarking using microbenchmark 


You can use the microbenchmark package to conduct "sub-millisecond accurate timing of expression evaluation". 


In this example we are comparing the speeds of six equivalent data. table expressions for updating elements in a 
group, based on a certain condition. 


More specifically: 


A data.table with 3 columns: id, time and status. For each id, | want to find the record with the 
maximum time - then if for that record if the status is true, | want to set it to false if the time is > 7 


library(microbenchmark) 
library(data. table) 


set .seed( 20160723) 
dt <- data.table(id = c(rep(seq(1:10000), each = 1@)), 
time = c(rep(seq(1:10000), 10)), 
status = c(sample(c(TRUE, FALSE), 10000*1@, replace = TRUE))) 
setkey(dt, id, time) ## create copies of the data so the ‘updates-by-reference' don't affect other 
expressions 
dt1 <- copy(dt) 
dt2 <- copy(dt) 
dt3 <- copy(dt) 
dt4 <- copy(dt) 
dt5 <- copy(dt) 
dt6 <- copy(dt) 


microbenchmark( 


expression_1 = { 
dt1[ dt1[order(time), .I[.N], by = id]$V1, status := status * time < 7 ] 
i 


expression_2 = { 
dt2[,status := 
5 


c(.SD[-.N, status], .SD[.N, status * time > 7]), by = id] 


expression_3 = { 
dt3[dt3[,.N, by = id][,cumsum(N)], status := status * time > 7] 
bs 


expression_4 = { 

y <- dt4[, .SD[.N], by=id] 

dt4[y, status := status & time > 7] 
ie 


expression_5 = { 
y <- dt5[, .SD[.N, .(time, status)], by = id][time > 7 & status] 
dt5[y, status := FALSE] 

i 


expression_6 = { 


dt6[ dt6[, .I == .I[which.max(time)], by = id]SV1 & time > 7, status := FALSE] 
le 


Goalkicker.com - R Notes for Professionals 224 


times = 10L ## specify the number of times each expression is evaluated 


) 

# Unit: milliseconds 

# expr min lq mean median 

# expression_1 11.646149 13.201670@ 16.808399 15.643384 18. 
# expression_2 8051.898126 8777.016935 9238.323459 8979.553856 9281. 
# expression_3 3.208773 3.385841 4.207903 4.089515 4 

# expression_4 15.758441 16.247833 20.677038 19.028982 21 

# expression_5 7552.970295 8051.080753 8702 .064620 8861.608629 9308. 
# expression_6 18.403105 18.812785 22.427984 21.966764 24 


The output shows that in this test expression_3 is the fastest. 
References 
data.table - Adding and modifying columns 


data.table - special grouping symbols in data.table 


Section 50.2: proc.time() 


ug 
78640 
93377 


. 70146 
.04178 


62842 


66938 


max neval 
.321346 10 
869058 10 
.654702 10 
asi eu a} 10 
234921 10 
.607064 10 


At its simplest, proc.time() gives the total elapsed CPU time in seconds for the current process. Executing it in the 


console gives the following type of output: 


proc. time() 


# user system elapsed 
# 284.507 128.397 515029.305 


This is particularly useful for benchmarking specific lines of code. For example: 


t1 <- proc.time() 
fibb <- function (n) { 


Lf (nieces) 4 
return(c(0@,1)[n]) 
} else { 


return(fibb(n - 2) + fibb(n -1)) 
} 
} 
print("Time one") 
print(proc.time() - t1) 


t2 <- proc.time() 
fibb(30) 


print("Time two") 
print(proc.time() - t2) 


This gives the following output: 


source('~/.active-rstudio-document' ) 


# [1] "Time one" 
# user system elapsed 
# 2) 2) 2) 


# [1] "Time two" 
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# user system elapsed 
#15384 OOn2) 1572 


system.time() is a wrapper for proc. time() that returns the elapsed time for a particular command/expression. 


print(t1 <- system.time(replicate(100@,12%2)) ) 
## user system elapsed 
## 0.000 @.000 @.002 


Note that the returned object, of class proc. time, is slightly more complicated than it appears on the surface: 


str(t1) 
## Class 'proc_time' Named num [1:5] @ @ @.002 8 @ 
## = ..- attr(*, "names")= chr [1:5] “user.self" "“sys.self" "elapsed" "“user.child" 


Section 50.3: Microbenchmark 


Microbenchmark is useful for estimating the time taking for otherwise fast procedures. For example, consider 
estimating the time taken to print hello world. 


system.time(print("hello world")) 


# [1] "hello world" 
# user system elapsed 
# 2) 2) 2) 


This is because system. time is essentially a wrapper function for proc. time, which measures in seconds. As 
printing "hello world" takes less than a second it appears that the time taken is less than a second, however this is 
not true. To see this we can use the package microbenchmark: 


library(microbenchmark) 
microbenchmark(print("hello world") ) 


# Unit: microseconds 


# expr min lq mean median uq max neval 
# print("hello world") 26.336 29.984 44.11637 44.6835 45.415 158.824 100 


Here we can see after running print("hello world") 100 times, the average time taken was in fact 44 
microseconds. (Note that running this code will print "hello world" 100 times onto the console.) 


We can compare this against an equivalent procedure, cat("hello world\n"), to See if it is faster than 
print("hello world"): 


microbenchmark(cat("hello world\n")) 
# Unit: microseconds 


# expr min lq mean median uq max neval 
# cat("hello world\\n") 14.093 17.6975 23.73829 19.319 20.996 119.382 100 


In this case cat() is almost twice as fast as print(). 
Alternatively one can compare two procedures within the same microbenchmark Call: 


microbenchmark(print("hello world"), cat("hello world\n")) 
# Unit: microseconds 
# expr min lq mean median uq max neval 
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# print("hello world") 29.122 31.654 39.64255 34.5275 38.852 192.779 180 
# cat("hello world\\n") 9.381 12.356 13.8382@ 12.9938 13.715 52.564 100 


Section 50.4: System.time 


System time gives you the CPU time required to execute a R expression, for example: 


system.time(print("hello world")) 


# [1] "hello world" 
# user system elapsed 
# 2) 2) 7) 


You can add larger pieces of code through use of braces: 


system. time({ 
library(numbers) 
Primes(1,10‘5) 
}) 


Or use it to test functions: 


fibb <- function (n) { 


Lh (Nic) 
return(c(0@,1)[n]) 
} else { 


return(fibb(n - 2) + fibb(n -1)) 
} 
} 


system. time(fibb(3@) ) 


Section 50.5: Line Profiling 


One package for line profiling is lineprof which is written and maintained by Hadley Wickham. Here is a quick 
demonstration of how it works with auto. arima in the forecast package: 


library(lineprof) 
library(forecast) 


1 <- lineprof(auto.arima(AirPassengers) ) 
shine(1) 


This will provide you with a shiny app, which allows you to delve deeper into every function call. This enables you to 
see with ease what is causing your R code to slow down. There is a screenshot of the shiny app below: 
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Line profiling | 2c 


# Snurne cade 


1 
2 
3 
4 
5 
6 
7 
8 
9 


10 
11 
12 
13 
14 
iS 
16 
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nsdiffs/OCSBtest 
nsdiffs/diff 
nsdiffs/oOcsBtest 
diff/diff.ts 
ndiffs/suppressWarnings 
ndiffs/diff 
diff/diff.ts 
try/tryCatch 


myarima/suppressWarnings 


myarima/suppressWarnings 
myarima 


myarima/suppressWarnings 


myarima/suppressWarnings 


data. frame 


~ 


» 


ke 


228 


Chapter 51: Control flow structures 


Section 51.1: Optimal Construction of a For Loop 


To illustrate the effect of good for loop construction, we will calculate the mean of each column in four different 


ways: 


. Using a poorly optimized for loop 

. Using a well optimized for for loop 
. Using an «apply family of functions 
. Using the colMeans function 


BR WDN > 


Each of these options will be shown in code; a comparison of the computational time to execute each option will be 


shown; and lastly a discussion of the differences will be given. 


Poorly optimized for loop 


column_mean_poor <- NULL 
for (i in 1:length(mtcars) ) { 
column_mean_poor[i] <- mean(mtcars[[i]]) 


} 


Well optimized for loop 


column_mean_optimal <- vector("numeric", length(mtcars) ) 
for (i in seq_along(mtcars) ) { 
column_mean_optimal <- mean(mtcars[[i]]) 


} 


vapply Function 


column_mean_vapply <- vapply(mtcars, mean, numeric(1) ) 


colMeans Function 


column_mean_colMeans <- colMeans(mtcars) 


Efficiency comparison 


The results of benchmarking these four approaches is shown below (code not displayed) 


Unit: microseconds 
expr min lq mean median uq 


max neval cld 


poor 240.986 262.0820 287.1125 275.8160 307.2485 442.609 100 d 
optimal 220.313 237.4455 258.8426 247.0735 280.9130 362.469 1080 2c 
vapply 107.042 109.7320 124.4715 113.4130 132.6695 202.473 108 a 

colMeans 155.183 161.6955 180.2067 175.0045 194.2605 259.958 100 b 


Notice that the optimized for loop edged out the poorly constructed for loop. The poorly constructed for loop is 
constantly increasing the length of the output object, and at each change of the length, R is reevaluating the class of 


the object. 


Some of this overhead burden is removed by the optimized for loop by declaring the type of output object and its 


length before starting the loop. 


In this example, however, the use of an vapply function doubles the computational efficiency, largely because we 
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told R that the result had to be numeric (if any one result were not numeric, an error would be returned). 


Use of the colMeans function is a touch slower than the vapply function. This difference is attributable to some 


error checks performed in colMeans and mainly to the as.matrix conversion (because mtcars is a data. frame) that 
weren't performed in the vapp1ly function. 


Section 51.2: Basic For Loop Construction 


In this example we will calculate the squared deviance for each column in a data frame, in this case the mtcars. 
Option A: integer index 


squared_deviance <- vector("list", length(mtcars) ) 
for (i in seq_along(mtcars) ) { 
squared_deviance[[i]] <- (mtcars[[i]] - mean(mtcars[[i]]))‘2 


} 


squared_deviance is an 11 elements list, as expected. 


class(squared_deviance) 
length(squared_deviance) 


Option B: character index 


squared_deviance <- vector("list", length(mtcars) ) 
Squared_deviance <- setNames(squared_deviance, names(mtcars) ) 
for (k in names(mtcars) ) { 


squared_deviance[[k]] <- (mtcars[[k]] - mean(mtcars[[k]]))42 


} 


What if we want a data. frame as a result? Well, there are many options for transforming a list into other objects. 
However, and maybe the simplest in this case, will be to store the for results in a data. frame. 


squared_deviance <- mtcars #copy the original 


squared_deviance[TRUE]<-NA #replace with NA or do squared_deviance|[, ]<-NA 
for (i in seq_along(mtcars) ) { 

squared_deviance[[i]] <- (mtcars[[i]] - mean(mtcars[[i]]))‘2 
} 


dim(squared_deviance) 
fal] S214 


The result will be the same event though we use the character option (B). 


Section 51.3: The Other Looping Constructs: while and repeat 
R provides two additional looping constructs, while and repeat, which are typically used in situations where the 
number of iterations required is indeterminate. 

The while loop 

The general form of a while loop is as follows, 


while (condition) { 
## do something 
## in loop body 
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where condition is evaluated prior to entering the loop body. If condition evaluates to TRUE, the code inside of the 
loop body is executed, and this process repeats until condition evaluates to FALSE (or a break statement is 
reached; see below). Unlike the for loop, if a while loop uses a variable to perform incremental iterations, the 
variable must be declared and initialized ahead of time, and must be updated within the loop body. For example, 
the following loops accomplish the same task: 


for (i in 0:4) { 
cat(i, "\n'") 


+ HHH HY 
WN - © 


ake 32) 

while (i < 5) { 
cat(i, "\n'") 
akecgos bk op 


+# # HH HY 
RwWON - © 


In the while loop above, the line i <- i + 1 is necessary to prevent an infinite loop. 
Additionally, it is possible to terminate a while loop with a call to break from inside the loop body: 
iter <- @ 


while (TRUE) { 
if (runif(1) < 0.25) { 


break 
} else { 
iter <- iter + 1 
} 
} 
iter 
ALA) 4 


In this example, condition is always TRUE, so the only way to terminate the loop is with a call to break inside the 
body. Note that the final value of iter will depend on the state of your PRNG when this example is run, and should 
produce different results (essentially) each time the code is executed. 


The repeat loop 


The repeat construct is essentially the same as while (TRUE) { ## something }, and has the following form: 


repeat ({ 
## do something 
## in loop body 


}) 
The extra {} are not required, but the () are. Rewriting the previous example using repeat, 
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iter <- @ 


repeat ({ 
if (runif(1) < 0.25) { 
break 
} else { 
iter <- iter + 1 
} 
i 
iter 
call) 22 


More on break 


It's important to note that break will only terminate the immediately enclosing loop. That is, the following is an infinite 
loop: 


while (TRUE) { 
while (TRUE) { 
cat("inner loop\n") 
break 
} 


cat("outer loop\n") 


With a little creativity, however, it is possible to break entirely from within a nested loop. As an example, consider 
the following expression, which, in its current state, will loop infinitely: 


while (TRUE) { 
cat("outer loop body\n" ) 
while (TRUE) { 
cat("inner loop body\n" ) 
x <- runif(1) 


Li sOG< 3) 4 
break 
} else { 


cat(sprintf("x is %.5f\n", x)) 
} 


One possibility is to recognize that, unlike break, the return expression does have the ability to return control 
across multiple levels of enclosing loops. However, since return is only valid when used within a function, we 
cannot simply replace break with return() above, but also need to wrap the entire expression as an anonymous 
function: 


(function() { 
while (TRUE) { 
cat("outer loop body\n" ) 
while (TRUE) { 
cat("inner loop body\n" ) 
x <- runif(1) 


if (x < .3) { 
return() 
} else { 


cat(sprintf("x is %.5f\n", x)) 


} 
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HO 


Alternatively, we can create a dummy variable (exit) prior to the expression, and activate it via <<- from the inner 


loop when we are ready to terminate: 


exit <- FALSE 
while (TRUE) { 
cat("outer loop body\n" ) 
while (TRUE) { 
cat("inner loop body\n" ) 
x <- runif(1) 


if (x < .3) { 
exit <<- TRUE 
break 

} else { 


cat(sprintf("x is %.5f\n", x)) 
} 


if (exit) break 
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Chapter 52: Column wise operation 


Section 52.1: sum of each column 


Suppose we need to do the sum of each column in a dataset 


set.seed(20) 
df1 <- data.frame(ID = rep(c("A", "B", "C"), each = 3), V1 = rnorm(9), V2 = rnorm(9)) 
ml <- as.matrix(df1[-1]) 


There are many ways to do this. Using base R, the best option would be colSums 
colSums(df1[-1], na.rm = TRUE) 


Here, we removed the first column as it is non-numeric and did the sum of each column, specifying the na.rm = 
TRUE (in case there are any NAs in the dataset) 


This also works with matrix 

colSums(m1, na.rm = TRUE) 

This can be done in a loop with lapply/sapply/vapply 
lapply(df1[-1], sum, na.rm = TRUE) 

It should be noted that the output is a list. If we need a vector output 
sapply(df1[-1], sum, na.rm = TRUE) 

Or 
vapply(df1[-1], sum, na.rm = TRUE, numeric(1) ) 

For matrices, if we want to loop through columns, then use apply with MARGIN = 1 
apply(m1, 2, FUN = sum, na.rm = TRUE) 

There are ways to do this with packages like dplyr or data. table 


library(dplyr) 
df1 %>% 
summarise_at(vars(matches("4V\\d+")), sum, na.rm = TRUE) 


Here, we are passing a regular expression to match the column names that we need to get the sum in 
summarise_at. The regex will match all columns that start with V followed by one or more numbers (\\d+). 


A data.table option is 


library(data.table) 
setDT(df1)[, lapply(.SD, sum, na.rm = TRUE), .SDcols = 2:ncol(df1) ] 


We convert the 'data.frame' to 'data.table' (setDT(df1)), specified the columns to be applied the function in 
.SDcols and loop through the Subset of Data.table (.SD) and get the sum. 
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If we need to use a group by operation, we can do this easily by specifying the group by column/columns 
df1 %>% 


group_by(ID) %>% 
summarise_at(vars(matches("4V\\d+")), sum, na.rm = TRUE) 


In cases where we need the sum of all the columns, summarise_each can be used instead of summarise_at 
df1 %>% 

group_by(ID) %>% 

summarise_each(funs(sum(., na.rm = TRUE)) ) 


The data. table option is 


setDT(df1)[, lapply(.SD, sum, na.rm = TRUE), by = ID] 
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Chapter 53: JSON 
Section 53.1: JSON to / from R objects 


The jsonlite package is a fast JSON parser and generator optimized for statistical data and the web. The two main 
functions used to read and write JSON are fromJSON() and toJSON() respecitively, and are designed to work with 
vectors, matrices and data. frames, and streams of JSON from the web. 


Create a JSON array from a vector, and vice versa 


library(jsonlite) 
## vector to JSON 
toJSON(c(1,2,3)) 
# 1,2,31 


fromJSON('[1,2,3]') 
az (lhl (8) 


Create a named JSON array from a list, and vice versa 


toJSON(list(myVec = c(1,2,3))) 
# fumyVec = [l),.253)]\} 


fromJSON('{"myVec":[1,2,3]}') 


# SmyVec 
oe WL) a es 


More complex list structures 
## list structures 
Ist <- list(a = c(1,2,3), 
b = list(letters[1:6])) 
toJSON(1st) 
# {cau 6 [1 2S le “be : lilheacae alias au adler "ea", feelin 


fiGOMI SON Geqmea al ieee Olle Dee sll pe aiaeea On ae Oreneacluncclecsen: 10] pus) 


# Sa 

ve (ql a 23} 

# 

# Sb 

ot) ts 4l 15) 164 
alae bee Ch dies ven. wen fl 


Create JSON from a data. frame, and vice versa 


## converting a data.frame to JSON 
df <- data.frame(id = seq_along(1:10@), 
val = letters[1:10]) 


toJSON(df) 

# 

[eePaiol! | F New Bera yng ie A eiCla eee ENiailhs Paibuehs afnalte |i Or UNiailla 2 Cube {old eAy meal odie afiaesniia fe 25 NPE ei Ss eal 
d':6, val 2°F" }, fad" 7, “val 2"g"} (ad! 28, (val 2 hh id’ 9. “val” 27a") fad" 16, Wval’ 2" 7" 9] 
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## reading a JSON string 
hEOMISONGAleadetly Valet ane ted 22) aval nbinh ad vids ce ShuVellcCuy nd dcived cuvalleuncdiehe dd dias. eval 
Peal sO mmVialuas edeeh eal iu aValete dat \ldecGn walle tune te et Ca On Vale cule ate calla aViallia c 


—— 
—l 
7 


# 
(ak, 
< 

® 
H 


HH RH HHH HHH HU. 


- OMAN OA NHBPWDND 


fav) 
— 
DOAN ADA OKRWDN 
Se SO Sh) 1) (1) (S: 


Read JSON direct from the internet 


## Reading JSON from URL 
googleway_issues <- fromJSON("https://api.github.com/repos/SymbolixAU/googleway/issues" ) 


googleway_issuesSurl 

# [1] "“https://api.github.com/repos/SymbolixAU/googleway/issues/20" 

"https://api.github.com/repos/SymbolixAU/googleway/issues/19" 

# [3] “https://api.github.com/repos/SymbolixAU/googleway/issues/14" 

"https://api.github.com/repos/SymbolixAU/googleway/issues/11" 

# [5] "https://api.github.com/repos/SymbolixAU/googleway/issues/9" 
"https://api.github.com/repos/SymbolixAU/googleway/issues/5" 

# [7] "“https://api.github.com/repos/SymbolixAU/googleway/issues/2" 
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Chapter 54: RODBC 


Section 54.1: Connecting to Excel Files via RODBC 


While RODBC is restricted to Windows computers with compatible architecture between R and any target RDMS, one 
of its key flexibilities is to work with Excel files as if they were SQL databases. 


require(RODBC) 

con = odbcConnectExcel("myfile.xlsx") # open a connection to the Excel file 
sqlTables(con)S$TABLE_NAME # show all sheets 

df = sqlFetch(con, "Sheet1") # read a sheet 

df = sqlQuery(con, “select * from [Sheet1 $]") # read a sheet (alternative SQL syntax) 
close(con) # close the connection to the file 


Section 54.2: SOL Server Management Database connection 
to get individual table 


Another use of RODBC is in connecting with SQL Server Management Database. We need to specify the 'Driver' i.e. 
SQL Server here, the database name "Atilla" and then use the sqlQuery to extract either the full table or a fraction 
of it. 


library(RODBC) 

cn <- odbcDriverConnect(connection="Driver={SQL 
Server};server=localhost ; database=Atilla;trusted_connection=yes;") 
tbl <- sqlQuery(cn, ‘select top 10 * from table_1') 


Section 54.3: Connecting to relational databases 


library(RODBC) 

con <- odbcDriverConnect("driver={Sql Server};server=servername;trusted connection=true") 
dat <- sqlQuery(con, "select * from table"); 

close(con) 


This will connect to a SQL Server instance. For more information on what your connection string should look like, 
visit connectionstrings.com 


Also, since there's no database specified, you should make sure you fully qualify the object you're wanting to query 
like this databasename.schema.objectname 
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Chapter 55: lubridate 


Section 55.1: Parsing dates and datetimes from strings with 
lubridate 


The lubridate package provides convenient functions to format date and datetime objects from character strings. 
The functions are permutations of 


Letter Element to parse Base R equivalent 
y year %y, %Y 
m (with y and d) month %m, %b, %h, %B 
d day %d, %e 
h hour %H, %L%p 
m (with h and s) minute %M 
Ss seconds %S 


e.g. ymd() for parsing a date with the year followed by the month followed by the day, e.g. "2016-87-22", or 
ymd_hms() for parsing a datetime in the order year, month, day, hours, minutes, seconds, e.g. "2816-87-22 
13:04:47". 


The functions are able to recognize most separators (such as /, -, and whitespace) without additional arguments. 
They also work with inconsistent separators. 


Dates 


The date functions return an object of class Date. 


library(lubridate) 


mdy(e@  @7/02/201G 9 7/03) f 2016.) 71/74 7/16. ))) 
## [1] "2016-07-02" "2016-@7-@3" "2016-07-04" 


ymd(c("20160724" , "2816/87/23" ,"2016-07-25") ) # inconsistent separators 
## [1] "2016-07-24" "2016-07-23" "2016-67-25" 


Datetimes 
Utility functions 


Datetimes can be parsed using ymd_hms variants including ymd_hm and ymd_h. All datetime functions can accept a tz 
timezone argument akin to that of as. POSIXct or strptime, but which defaults to "UTC" instead of the local 
timezone. 


The datetime functions return an object of class POSIXct. 


x <- 6("20160724 130102", "2016/07/23 14:02:01", "2016-07-25 15:03:00") 
ymd_hms(x, tz="EST") 

## [1] "2016-07-24 13:01:02 EST" "2016-07-23 14:02:01 EST" 

## [3] "2016-07-25 15:03:00 EST" 


ymd_hms (x) 


## [1] "2016-07-24 13:01:02 UTC" "2016-07-23 14:02:01 UTC" 
## [3] "2016-07-25 15:03:08 UTC" 
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Parser functions 


lubridate also includes three functions for parsing datetimes with a formatting string like as. POSIXct or strptime: 


Function Output Class Formatting strings accepted 
Flexible. Will accept strptime-style with % or lubridate datetime 
parse_date_time POSIXct function name style, e.g “ymd hms". Will accept a vector of orders for 


heterogeneous data and guess which is appropriate. 


: Default POSIXct; if1t Strict. Accepts only strptime tokens (with or without %) from a limited 
parse_date_time2 


= TRUE, POSIXIt set. 
¢ : Default POSIXIt; if 1t = Strict. Accepts only %-delimited strptime tokens with delimiters (-, /, :, 
ast_strptime ae 
FALSE, POSIXct etc.) from a limited set. 


xX <- ¢€('2016-07-22 13:04:47', '@7/22/2016 1:04:47 pm') 


parse_date_time(x, orders = c('mdy Imsp', ‘ymd hms')) 
## [1] "2016-07-22 13:04:47 UTC" "2016-07-22 13:04:47 UTC" 


X= (2016-07-22 1304747 5 2016-07-22 14°47-58)) 


parse_date_time2(x, orders = 'Ymd HMS') 
## [1] "2016-07-22 13:04:47 UTC" "2016-07-22 14:47:58 UTC" 


fast_strptime(x, format = '%Y-%m-%d %H:%M:%S' ) 
## [1] "2016-07-22 13:04:47 UTC" "2016-07-22 14:47:58 UTC" 


parse_date_time2 and fast_strptime use a fast C parser for efficiency. 


See ?parse_date_time for formatting tokens. 


Section 55.2: Difference between period and duration 


Unlike durations, periods can be used to accurately model clock times without knowing when events such as leap 
seconds, leap days, and DST changes occur. 


start_2012 <- ymd_hms("2012-@1-@1 12:00:00") 
## [1] "2012-01-81 12:00:08 UTC" 


# period() considers leap year calculations. 
start_2012 + period(1, "years") 
## [1] "2013-01-01 12:00:00 UTC" 


# Here duration() doesn't consider leap year calculations. 
start_2012 + duration(1) 
## [1] "2012-12-31 12:00:08 UTC" 


Section 55.3: Instants 


An instant is a specific moment in time. Any date-time object that refers to a moment of time is recognized as an 
instant. To test if an object is an instant, use is. instant. 


library(lubridate) 


today_start <- dmy_hms("22.07.2016 12:00:00", tz = "IST") # default tz="UTC" 
today_start 

## [1] "2016-07-22 12:00:00 IST" 

is.instant(today_start) 

## [1] TRUE 
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now_dt <- ymd_hms(now(), tz="IST") 
now_dt 

## [1] "2016-07-22 13:53:09 IST" 
is.instant(now_dt) 

## [1] TRUE 


is.instant("helloworld") 
## [1] FALSE 
is.instant(6@) 

## [1] FALSE 


Section 55.4: Intervals, Durations and Periods 


Intervals are simplest way of recording timespans in lubridate. An interval is a span of time that occurs between 


two specific instants. 


# create interval by subtracting two instants 
today_start <- ymd_hms("2016-07-22 12-00-00", tz="IST") 
today_start 

## [1] "2016-07-22 12:00:00 IST" 

today_end <- ymd_hms("2016-@7-22 23-59-59", tz="IST") 
today_end 

## [1] "2016-07-22 23:59:59 IST" 

span <- today_end - today_start 

span 

## Time difference of 11.99972 hours 

as.interval(span, today_start) 

## [1] 2016-07-22 12:00:00 IST--2016-07-22 23:59:59 IST 


# create interval using interval() function 


span <- interval(today_start, today_end) 
[1] 2016-07-22 12:08:00 IST--2016-07-22 23:59:59 IST 


Durations measure the exact amount of time that occurs between two instants. 


duration(6@, "seconds" ) 
## [1] "60s" 


duration(2, "“minutes") 
## [1] "120s (~2 minutes) " 


Note: Units larger than weeks are not used due to their variability. 


Durations can be created using dseconds, dminutes and other duration helper functions. 


Run ?quick_durations for complete list. 


dseconds(6@) 
aie (| BOOS” 


dhours(2) 
## [1] "7200s (~2 hours)" 


dyears(1) 
## [1] "315360@@s (~365 days)" 


Durations can be subtracted and added to instants to get new instants. 


today_start + dhours(5) 
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## [1] "2016-07-22 17:00:60 IST" 


today_start + dhours(5) + dminutes(3@) + dseconds(15) 
## [1] "2016-07-22 17:30:15 IST" 


Durations can be created from intervals. 


as.duration(span) 
[1] "43199s (~12 hours)" 


Periods measure the change in clock time that occurs between two instants. 


Periods can be created using period function as well other helper functions like seconds, hours, etc. To get a 


complete list of period helper functions, Run ?quick_periods. 


period(1, "“hour") 
## [1] "1H @M QS" 


hours(1) 
## [1] "1H @M QS" 


period(6, "months") 
## [1] "6m Od OH OM @S" 


months(6) 
## [1] "6m Od OH OM @S" 


years(1) 
## [1] "ly Om @d @H OM @S" 


is.period function can be used to check if an object is a period. 


is.period(years(1)) 
## [1] TRUE 


is.period(dyears(1) ) 
## [1] FALSE 


Section 55.5: Manipulating date and time in lubridate 


date <- now() 
date 
HH 2016-07-22 03:42:35) TST” 


year(date) 
## 2016 


minute(date) 
## 42 


wday(date, label = T, abbr = T) 
# (ae) era: 


# Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat 


day(date) <- 31 
## "2016-07-31 03:42:35 IST" 


# If an element is set to a larger value than it supports, the difference 
# will roll over into the next higher element 
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day(date) <- 32 
## "2016-08-01 03:42:35 IST" 


Section 55.6: Time Zones 
with_tz returns a date-time as it would appear in a different time zone. 


nyc_time <- now("America/New_York") 
nyc_time 

## [1] "2016-07-22 05:49:08 EDT" 

# corresponding Europe/Moscow time 


with_tz(nyc_time, tzone = "Europe/Moscow" ) 
## [1] "2016-07-22 12:49:08 MSK" 


force_tz returns a the date-time that has the same clock time as x in the new time zone. 


nyc_time <- now("America/New_York") 
nyc_time 
## [1] "2016-07-22 05:49:08 EDT" 


force_tz(nyc_time, tzone = "Europe/Moscow") # only timezone changes 
## [1] "2016-07-22 05:49:08 MSK" 


Section 55.7: Parsing date and time in lubridate 


Lubridate provides ymd() series of functions for parsing character strings into dates. The letters y, m, and d 
correspond to the year, month, and day elements of a date-time. 


mdy("@7-21-2016") # Returns Date 

## [1] "2016-07-21" 

mdy("@7-21-2016", tz = "UTC") # Returns a vector of class POSIXt 
## "2016-07-21 UTC" 

dmy ("21-07-2016") # Returns Date 

## [1] "2016-07-21" 

dmy(c("21.07.2016", "22.07.2016")) # Returns vector of class Date 


## [1] "2016-07-21" "2016-@7-22" 


Section 55.8: Rounding dates 


now_dt <- ymd_hms(now(), tz="IST") 
now_dt 
## [1] "2016-07-22 13:53:09 IST" 


round_date() takes a date-time object and rounds it to the nearest integer value of the specified time unit. 


round_date(now_dt, "minute") 
## [1] "2016-07-22 13:53:00 IST" 


round_date(now_dt, "hour") 
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## [1] "2016-07-22 14:00:00 IST" 


round_date(now_dt, "year") 
## [1] "2017-01-01 IST" 


floor_date() takes a date-time object and rounds it down to the nearest integer value of the specified time unit. 


floor_date(now_dt, "minute" ) 
## [1] "2016-07-22 13:53:08 IST" 


floor_date(now_dt, "hour") 
## [1] "2016-07-22 13:00:00 IST" 


floor_date(now_dt, "year") 
## [1] "2016-01-01 IST" 


ceiling_date() takes a date-time object and rounds it up to the nearest integer value of the specified time unit. 


ceiling_date(now_dt, "minute") 
## [1] "2016-07-22 13:54:00 IST" 


ceiling_date(now_dt, "hour" ) 
## [1] "2016-07-22 14:00:00 IST" 


ceiling_date(now_dt, "year" ) 
## [1] "2017-01-01 IST" 
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Chapter 56: Time Series and Forecasting 


Section 56.1: Creating a ts object 


Time series data can be stored as a ts object. ts objects contain information about seasonal frequency that is used 
by ARIMA functions. It also allows for calling of elements in the series by date using the window command. 


#Create a dummy dataset of 100 observations 
x <- rnorm(100) 


#Convert this vector to a ts object with 100 annual observations 
x <- ts(x, start = c(1900), freq = 1) 


#Convert this vector to a ts object with 100 monthly observations starting in July 
x <- ts(x, start = c(1900, 7), freq = 12) 


#Alternatively, the starting observation can be a number: 
x <- ts(x, start = 1900.5, freq = 12) 


#Convert this vector to a ts object with 100 daily observations and weekly frequency starting in 
the first week of 1900 
x <- ts(x, start = c(1900, 1), freq = 7) 


#The default plot for a ts object is a line plot 
plot(x) 


#The window function can call elements or sets of elements by date 


#Call the first 4 weeks of 1900 
window(x, start = c(1900, 1), end = (1900, 4)) 


#Call only the 10th week in 1900 
window(x, start = c(1900, 10), end = (1900, 10@)) 


#Call all weeks including and after the 10th week of 1900 
window(x, start = c(1900, 1@)) 


It is possible to create ts objects with multiple series: 


#Create a dummy matrix of 3 series with 100 observations each 
x <- cbhind(rnorm(100), rnorm(10@), rnorm(19@) ) 


#Create a multi-series ts with annual observation starting in 1900 
xX <- ts(x, start = 1900, freq = 1) 


#R will draw a plot for each series in the object 
plot(x) 


Section 56.2: Exploratory Data Analysis with time-series data 


data(AirPassengers) 
class(AirPassengers) 


i] "ts" 


In the spirit of Exploratory Data Analysis (EDA) a good first step is to look at a plot of your time-series data: 
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plot(AirPassengers) # plot the raw data 
abline(reg=1m(AirPassengers~time(AirPassengers))) # fit a trend line 


AirPassengers 
400 500 600 


300 


200 


100 


1950 1952 1954 1956 1958 1960 


Time 
For further EDA we examine cycles across years: 


cycle(AirPassengers) 


Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 


1949 1 Dy Se ot en On | Oe Oe Ol il pal 2. 
19580 1 Die ae Ae Oe Oren Fn O) ee Oe lCleay lil mam 
195i 1 Ze oA oe Ones m0 ee OU Glee iy aT: 
1952 1 2S eA eh ue Omen 7, eon er Ole (Ole ame (2: 
1953 1 (ER A ey sy 7k ats TET SN le 
1954 1 2 Se 9 a Oe Oe eo OP ali el 2 
1955 1 2S) A eo) Oe ao ee Oe Ole eile 2. 
1956 1 Zee 4 OPO ee ee Gee Oe Onell aa: 
1957 1 2 eA es ont On, ee Ot ec OU Ola liam (112: 
1958 1 22 3 Ae ee Osea pm Oe Ole Ole lla: 
iPS) 1 Poe Sy A ye sf ats SI eli ile 
1960 1 Vig) ei SLE ey Se gee teh Ra pales 2 


boxplot (AirPassengers~cycle(AirPassengers)) #Box plot across months to explore seasonal effects 
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Chapter 57: strsplit function 


Section 57.1: Introduction 


strsplit is a useful function for breaking up a vector into an list on some character pattern. With typical R tools, 
the whole list can be reincorporated to a data.frame or part of the list might be used in a graphing exercise. 


Here is a common usage of strsplit: break a character vector along a comma separator: 


temp <- c("this,that,other", "hat,scarf,food", "woman,man, child") 
# get a list split by commas 


myList <- strsplit(temp, split=",") 
# print myList 

myList 

ie 

ile ethics: that Sothern: 


ee) 
hate Sear. 1 fOO0ds 


[[3]] 


[1] "woman" "man" ehaild™ 


As hinted above, the split argument is not limited to characters, but may follow a pattern dictated by a regular 
expression. For example, temp2 is identical to temp above except that the separators have been altered for each 


item. We can take advantage of the fact that the split argument accepts regular expressions to alleviate the 
irregularity in the vector. 


temp2 <- c("this, that, other", "hat,scarf ,food", 
myList2 <- strsplit(temp2, split=" ?[,;] ?") 
myList2 

ia 

[1] "this" "that" "other" 


“woman; man ; child") 


[[2]] 


fal Shak! "scarf" "food" 

[isi 

[1] "woman" "man" Uehadlid” 
Notes: 


1. breaking down the regular expression syntax is out of scope for this example. 
2. Sometimes matching regular expressions can slow down a process. As with many R functions that allow the 
use of regular expressions, the fixed argument is available to tell R to match on the split characters literally. 
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Chapter 58: Web scraping and parsing 


Section 58.1: Basic scraping with rvest 


rvest is a package for web scraping and parsing by Hadley Wickham inspired by Python's Beautiful Soup. It 
leverages Hadley's xm12 package's libxm12 bindings for HTML parsing. 


As part of the tidyverse, rvest is piped. It uses 


e xm12::read_html1 to scrape the HTML of a webpage, 
e which can then be subset with its html_node and html_nodes functions using CSS or XPath selectors, and 
¢ parsed to R objects with functions like html_text and html_table. 


To scrape the table of milestones from the Wikipedia page on R, the code would look like 


library(rvest) 
url <- ‘https://en.wikipedia.org/wiki/R_(programming_language) ' 


# scrape HTML from website 
url %>% read_html() %>% 
# select HTML tag with class="wikitable" 
html_node(css = '.wikitable') %>% 
# parse table into data.frame 
html_table() %>% 
# trim for printing 


dplyr::mutate(Description = substr(Description, 1, 7@)) 


## Release Date Description 
## 1 @.16 This is the last alpha version developed primarily by Ihaka 
## 2 @.49 1997-04-23 This is the oldest source release which is currently availab 
## 3 @.68 1997-12-85 R becomes an official part of the GNU Project. The code is h 
## 4 0.65.1 1999-10-07 First versions of update.packages and install.packages funct 
## 5 1.@ 2000-02-29 Considered by its developers stable enough for production us 
## 6 1.4 2001-12-19 S4 methods are introduced and the first version for Mac OS X 
## 7 2.0 2004-10-04 Introduced lazy loading, which enables fast loading of data 
## 8 2.1 2005-84-18 Support for UTF-8 encoding, and the beginnings of internatio 
## 9 2.11 2010-04-22 Support for Windows 64 bit systems. 
## 10 2.13 2011-04-14 Adding a new compiler function that allows speeding up funct 
## 11 2.14 2011-10-31 Added mandatory namespaces for packages. Added a new paralle 
## 12 2.15 2012-03-30 New load balancing functions. Improved serialization speed f 
## 13 3.0 2013-04-03 Support for numeric index values 231 and larger on 64 bit sy 


While this returns a data.frame, note that as is typical for scraped data, there is still further data cleaning to be 
done: here, formatting dates, inserting NAs, and so on. 


Note that data in a less consistently rectangular format may take looping or other further munging to successfully 
parse. If the website makes use of jQuery or other means to insert content, read_html may be insufficient to 
scrape, and a more robust scraper like RSelenium may be necessary. 


Section 58.2: Using rvest when login is required 
| common problem encounter when scrapping a web is how to enter a userid and password to log into a web site. 
In this example which | created to track my answers posted here to stack overflow. The overall flow is to login, go to 


a web page collect information, add it a dataframe and then move to the next page. 
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library(rvest) 


#Address of the login webpage 
login<- 


"https://stackoverflow.com/users/login?ssrc=head&returnurl=http%3a%2f%2fstackover flow.com%2f" 


#create a web session with the desired login address 
pgsession<-html_session(login) 

pgform<-html_form(pgsession)[[2]] #in this case the submit is the 2nd form 
filled_form<-set_values(pgform, email="*****", password="****x" 
submit_form(pgsession, filled_form) 


#pre allocate the final results dataframe. 
results<-data.frame() 


#loop through all of the pages with the desired info 
for (i in 1:5) 
{ 


#base address of the pages to extract information from 


url<-"http://stackoverflow.com/users/***x***x*xkx* ?tab=answers&sort=activity&page=" 


url<-paste@(url, i) 
page<-jump_to(pgsession, url) 


#collect info on the question votes and question title 
summary<-html_nodes(page, "div .answer-summary" ) 


question<-matrix(html_text(html_nodes(summary, "div"), trim=TRUE), ncol=2, byrow 


#find date answered, hyperlink and whether it was accepted 
dateans<-html_node(summary, "“span") %>% html_attr("title") 
hyperlink<-html_node(summary, "div a") %>% html_attr("href") 
accepted<-html_node(summary, "div") %>% html_attr("class") 


#create temp results then bind to final results 
rtemp<-cbind(question, dateans, accepted, hyperlink) 
results<-rbind(results, rtemp) 


} 


#Dataframe Clean-up 

names(results)<-c("Votes", "Answer", "Date", "Accepted", "HyperLink") 
results$Votes<-as.integer(as.character(resultsSVotes) ) 
resultsSAccepted<-ifelse(resultsSAccepted=="answer-votes default", 0, 1) 


The loop in this case is limited to only 5 pages, this needs to change to fit your application. | replaced the user 


specific values with ******, hopefully this will provide some guidance for you problem. 
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Chapter 59: Generalized linear models 


Section 59.1: Logistic regression on Titanic dataset 


Logistic regression is a particular case of the generalized linear model, used to model dichotomous outcomes (probit 
and complementary log-log models are closely related). 


The name comes from the /ink function used, the /ogit or log-odds function. The inverse function of the /ogit is called 
the /ogistic function and is given by: 


This function takes a value between J-/nf;+/nf[ and returns a value between 0 and 7; i.e the /ogistic function takes a 
linear predictor and returns a probability. 


Logistic regression can be performed using the g1m function with the option family = binomial (shortcut for 
family = binomial(link="logit" ); the /ogit being the default link function for the binomial family). 


In this example, we try to predict the fate of the passengers aboard the RMS Titanic. 


Read the data: 


url <- "http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt" 
titanic <- read.csv(file = url, stringsAsFactors = FALSE) 


Clean the missing values: 


In that case, we replace the missing values by an approximation, the average. 
titanicSage[is.na(titanicSage)] <- mean(titanicS$age, na.rm = TRUE) 
Train the model: 


titanic.train <- glm(survived ~ pclass + sex + age, 
family = binomial, data = titanic) 


Summary of the model: 
summary(titanic.train) 


The output: 


Call: 
glm(formula = survived ~ pclass + sex + age, family = binomial, data = titanic) 


Deviance Residuals: 
Min 1Q Median 3Q Max 
-2.6452 -@.6641 -@.3679 0.6123 2.5615 


Coefficients: 

Estimate Std. Error z value Pr(>|z]|) 
(Intercept) 3.552261 @.342188 10.381 < 2e-16 *xx 
pclass2nd = le PASAT ATL @.211559 -5.534 3.13e-08 xxx 
pclass3rd -2.430672 @.195157 -12.455 < 2e-16 *xx 
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sexmale -2.463377 GO 154587 15.935 -< 2e-1654*% 
age -@.042235 @.007415 -5.696 1.23e-08 xxx 


Signif. codes: @ ‘**x*’ 9.001 ‘**’ @.01 ‘* @.05 ‘.’ 0.1 ‘7 1 
(Dispersion parameter for binomial family taken to be 1) 


Null deviance: 1686.8 on 1312 degrees of freedom 
Residual deviance: 1165.7 on 1308 degrees of freedom 
AIG 1175.7 


Number of Fisher Scoring iterations: 5 


e The first thing displayed is the call. It is a reminder of the model and the options specified. 


e Next we see the deviance residuals, which are a measure of model fit. This part of output shows the 
distribution of the deviance residuals for individual cases used in the model. 


e The next part of the output shows the coefficients, their standard errors, the z-statistic (sometimes called a 
Wald z-statistic), and the associated p-values. 


° The qualitative variables are "dummified". A modality is considered as the reference. The reference 
modality can be change with I in the formula. 

o All four predictors are statistically significant at a 0.1 % level. 

o The logistic regression coefficients give the change in the log odds of the outcome for a one unit 
increase in the predictor variable. 

° To see the odds ratio (multiplicative change in the odds of survival per unit increase in a predictor 
variable), exponentiate the parameter. 

° To see the confidence interval (Cl) of the parameter, use confint. 


e Below the table of coefficients are fit indices, including the null and deviance residuals and the Akaike 
Information Criterion (AIC), which can be used for comparing model performance. 


o When comparing models fitted by maximum likelihood to the same data, the smaller the AIC, the 
better the fit. 

o One measure of model fit is the significance of the overall model. This test asks whether the model 
with predictors fits significantly better than a model with just an intercept (i.e., a null model). 


Example of odds ratios: 


exp(coef(titanic.train) [3]) 


pclass3rd 
@.08797765 


With this model, compared to the first class, the 3rd class passengers have about a tenth of the odds of survival. 


Example of confidence interval for the parameters: 


confint(titanic.train) 


Waiting for profiling to be done... 

2 Di O75 
(Intercept) 2.89486872 4.23734280 
pclass2nd -1.58986065 -@.75987230 
pelass3rd -2.81987935 -2.05419580 
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sexmale SLT NSIIO2ZT 2), 65283 116 
age -8.05695894 -8.02786211 


Exemple of calculating the significance of the overall model: 


The test statistic is distributed chi-squared with degrees of freedom equal to the differences in degrees of freedom 
between the current and the null model (i.e., the number of predictor variables in the model). 


with(titanic.train, pchisq(null.deviance - deviance, df.null - df.residual 
, lower.tail = FALSE)) 
[1] 1.892539e-111 


The p-value is near 0, showing a strongly significant model. 
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Chapter 60: Reshaping data between long 
and wide forms 


In R, tabular data is stored in data frames. This topic covers the various ways of transforming a single table. 


Section 60.1: Reshaping data 


Often data comes in tables. Generally one can divide this tabular data in wide and long formats. In a wide format, 
each variable has its own column. 


Person Height [cm] Age [yr] 


Alison 178 20 
Bob 174 45 
Carl 182 31 


However, sometimes it is more convenient to have a long format, in which all variables are in one column and the 
values are in a second column. 


Person Variable Value 
Alison Height [cm] 178 


Bob Height [cm] 174 
Carl Height [cm] 182 
Alison Age [yr] 20 
Bob Age [yr] 45 
Carl Age [yr] 31 


Base R, as well as third party packages can be used to simplify this process. For each of the options, the mtcars 
dataset will be used. By default, this dataset is in a long format. In order for the packages to work, we will insert the 
row names as the first column. 


mtcars # shows the dataset 
data <- data.frame(observation=row.names(mtcars) ,mtcars) 


Base R 


There are two functions in base R that can be used to convert between wide and long format: stack() and 


unstack(). 
long <- stack(data) 
long # this shows the long format 


wide <- unstack(long) 
wide # this shows the wide format 


However, these functions can become very complex for more advanced use cases. Luckily, there are other options 
using third party packages. 


The tidyr package 
This package uses gather () to convert from wide to long and spread() to convert from long to wide. 


library(tidyr) 
long <- gather(data, variable, value, 2:12) # where variable is the name of the 
# variable column, value indicates the name of the value column and 2:12 refers to 
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# the columns to be converted. 

long # shows the long result 

wide <- spread(long, variable, value) 
wide # shows the wide result (~data) 


The data.table package 


The data.table package extends the reshape2 functions and uses the function melt() to go from wide to long and 
dcast() to go from long to wide. 


library(data. table) 

long <- melt(data, ‘observation’ ,2:12,'variable', ‘value’ ) 
long # shows the long result 

wide <- dcast(long, observation ~ variable) 

wide # shows the wide result (~data) 


Section 60.2: The reshape function 


The most flexible base R function for reshaping data is reshape. See ?reshape for its syntax. 


# create unbalanced longitudinal (panel) data set 

set.seed(1234) 

df <- data.frame(identifier=rep(1:5, each=3), 
location=rep(c("up", "down", "left", "up", "center"), each=3), 
period=rep(1:3, 5), counts=sample(35, 15, replace=TRUE), 
values=runif(15, 5, 10))[-c(4,8,11), ] 

df 


identifier location period counts values 


1 1 up 1 4 9.186478 
2 1 up 2. 22 6.431116 
3 1 up 3 22 6.334104 
5) 2 down Z 31 6.161138 
6 2: down 3 23 6.583062 
7 < left 1 1 6.513467 
i) 3 left 3 24 5.199980 
10 4 up 1 18 6.093998 
2 4 up 8 28 7.628488 
13 5 center 1 1G) 9573291 
14 5 center a: 33) Sell O725 
1S bE) center 3 Vi 5.22885i 


Note that the data.frame is unbalanced, that is, unit 2 is missing an observation in the first period, while units 3 and 
4 are missing observations in the second period. Also, note that there are two variables that vary over the periods: 
counts and values, and two that do not vary: identifier and location. 


Long to Wide 
To reshape the data.frame to wide format, 


# reshape wide on time variable 
df.wide <- reshape(df, idvar="identifier", timevar="period", 


v.names=c("values", "counts"), direction="wide") 
df .wide 
identifier location values.1 counts.1 values.2 counts.2 values.3 counts.3 
1 1 up 9.186478 4 6.431116 22 6.334104 22 
5 2, down NA NA 6.161130 31 6.583062 23 
7 3 left 6.513467 1 NA NA 5.199980 24 
10 4 up 6.093998 18 NA NA 7.628488 20 
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13 5 center 9.573291 16) SE AS6725 33 Oe 220851 11 


Notice that the missing time periods are filled in with NAs. 


In reshaping wide, the "v.names" argument specifies the columns that vary over time. If the location variable is not 
necessary, it can be dropped prior to reshaping with the "drop" argument. In dropping the only non-varying / non- 
id column from the data.frame, the v.names argument becomes unnecessary. 


reshape(df, idvar="identifier", timevar="period", direction="wide", 
drop="location" ) 


Wide to Long 


To reshape long with the current df.wide, a minimal syntax is 
reshape(df.wide, direction="long") 


However, this is typically trickier: 


# remove separator in df.wide names for counts and values 
names(df.wide)[grep("\\.", names(df.wide))] <- 
gsub("\\.", "", names(df.wide)[grep("\\.", names(df.wide))]) 


Now the simple syntax will produce an error about undefined columns. 


With column names that are more difficult for the reshape function to automatically parse, it is sometimes 
necessary to add the "varying" argument which tells reshape to group particular variables in wide format for the 
transformation into long format. This argument takes a list of vectors of variable names or indices. 


reshape(df.wide, idvar="identifier", 
varying=list(c(3,5,7), c(4,6,8)), direction="long") 


In reshaping long, the "v.names" argument can be provided to rename the resulting varying variables. 


Sometimes the specification of "varying" can be avoided by use of the "sep" argument which tells reshape what part 
of the variable name specifies the value argument and which specifies the time argument. 
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Chapter 61: RMarkdown and knitr 
presentation 


Parameter definition 

title the title of the document 

author The author of the document 

date The date of the document: Can be"r format(Sys.time(), ‘%d %B, %Y')" 
author The author of the document 


The output format of the document: at least 10 format available. For html document, html_output. For 


output PDF document, pdf_document, .. 


Section 61.1: Adding a footer to an ioslides presentation 


Adding a footer is not natively possible. Luckily, we can make use of jQuery and CSS to add a footer to the slides of 
an ioslides presentation rendered with knitr. First of all we have to include the jQuery plugin. This is done by the 
line 


<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.2/jquery.min.js"></script> 


Now we can use jQuery to alter the DOM (document object model) of our presentation. In other words: we alter the 
HTML structure of the document. As soon as the presentation is loaded ($(document) .ready(function() { ... 
})), we select all slides, that do not have the class attributes .title-slide, .backdrop, or .segue and add the tag 
<footer></footer> right before each slide is 'closed' (so before </slide=>). The attribute label carries the content 
that will be displayed later on. 


All we have to do now is to layout our footer with CSS: 
After each <footer> (footer: :after): 


e display the content of the attribute label 
e use font size 12 
e position the footer (20 pixels from the bottom of the slide and 60 pxs from the left) 


(the other properties can be ignored but might have to be modified if the presentation uses a different style 
template). 


title: "Adding a footer to presentaion slides" 
author: "Martin Schmelzer" 

date: "26 Juli 2016" 

output: ioslides presentation 


***{r setup, include=FALSE} 
knitr::opts_chunk$set(echo = FALSE) 


VAY 
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## Slide 1 

This is slide 1. 
## Slide 2 

This is slide 2 
# Test 

## Slide 3 


And slide 3. 


The result will look like this: 


Slide 1 


This is slide 1. 


My amazing footer 2/5 


Section 61.2: Rstudio example 


This is a script saved as .Rmd, on the contrary of r scripts saved as .R. 


To knit the script, either use the render function or use the shortcut button in Rstudio. 


title: "Rstudio exemple of a rmd file" 
author: ‘stack user' 
date: "22 July 2016" 
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output: html document 


The header is used to define the general parameters and the metadata. 

## R Markdown 

This is an R Markdown document. 

It is a script written in markdown with the possibility to insert chunk of R code in it. 
To insert R code, it needs to be encapsulated into inverted quote. 

Like that for a long piece of code: 


NAY 


{r cars} 
summary(cars) 


VAY 


And like ~*r cat("that")** for small piece of code. 
## Including Plots 
You can also embed plots, for example: 


***{r echo=FALSE} 
plot(pressure) 
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Chapter 62: Scope of variables 


Section 62.1: Environments and Functions 
Variables declared inside a function only exist (unless passed) inside that function. 
Xac- 


foo <- function(x) { 


y< 3 
z<-xt+y 
return(z) 

} 

y 


Error: object 'y' not found 


Variables passed into a function and then reassigned are overwritten, but only inside the function. 


foo <- function(x) { 


xX <- 2 
y< 3 
Z<- x+y 
return(z) 
} 
foo(1) 
x 
5 


Variables assigned in a higher environment than a function exist within that function, without being passed. 


foo <- function() { 


y<- 3 
z<-xty 
return(z) 

} 

foo() 

4 


Section 62.2: Function Exit 


The on.exit() function is handy for variable clean up if global variables must be assigned. 


Some parameters, especially those for graphics, can only be set globally. This small function is common when 


GoalKicker.com - R Notes for Professionals 


259 


creating more specialized plots. 


new_plot <- function(...) { 
old_pars <- par(mar = c(5,4,4,2) + .1, mfrow = c(1,1)) 


on.exit(par(old_pars) ) 
plot(...) 


Section 62.3: Sub functions 


Functions called within a function (ie subfunctions) must be defined within that function to access any variables 
defined in the local environment without being passed. 


This fails: 


bar <- function() { 
z<-xty 


return(z) 
} 
foo <- function() { 
y<- 3 
z <- bar() 
return(z) 
} 
foo() 
Error in bar() : object 'y' not found 
This works: 


foo <- function() { 


bar <- function() { 
z<-xty 


return(z) 
} 
y<- 3 
z <- bar() 
return(z) 
} 
foo() 
4 


Section 62.4: Global Assignment 
Variables can be assigned globally from any environment using <<-. bar() can now access y. 


bar <- function() { 
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zZ<-xty 


return(z) 

} 

foo <- function() { 
y <<- 3 
z <- bar() 
return(z) 

} 

foo() 
4 


Global assignment is highly discouraged. Use of a wrapper function or explicitly calling variables from another local 
environment is greatly preferred. 


Section 62.5: Explicit Assignment of Environments and 
Variables 


Environments in R can be explicitly call and named. Variables can be explicitly assigned and call to or from those 
environments. 


A commonly created environment is one which encloses package : base or a subenvironment within package: base. 


el <- new.env(parent = baseenv()) 
e2 <- new.env(parent = e1) 


Variables can be explicitly assigned and call to or from those environments. 
assign("a", 3, envir = e1) 


get("a", envir = e1) 
get("a", envir = e2) 


Since e2 inherits from e1, a is 3 in both e1 and e2. However, assigning a within e2 does not change the value of a in 
el. 


assign("a", 2, envir = e2) 


get("a", envir = e2) 
get("a", envir = e1) 
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Chapter 63: Performing a Permutation 
Test 


Section 63.1: A fairly general function 


We will use the built in tooth growth dataset. We are interested in whether there is a statistically significant 
difference in tooth growth when the guinea pigs are given vitamin C vs orange juice. 


Here's the full example: 


teethVC = ToothGrowth[ToothGrowthSsupp == 'VC', ] 
teethOJ = ToothGrowth[ToothGrowthSsupp == '0OJ', ] 


permutationTest = function(vectorA, vectorB, testStat) { 
N = 1045 
fullSet = c(vectorA, vectorB) 
lengthA = length(vectorA) 
lengthB = length(vectorB) 
trials <- replicate(N, 
{index <- sample(lengthB + lengthA, size = lengthA, replace = FALSE) 
testStat((fullSet[index]), fullSet[-index]) } ) 
trials 
} 
vecl =teethVCS$len; 
vec2 =teethOJ$len; 
subtractMeans = function(a, b){ return (mean(a) - mean(b) )} 
result = permutationTest(vec1, vec2, subtractMeans) 
observedMeanDifference = subtractMeans(vecl, vec2) 
result = c(result, observedMeanDifference) 
hist(result) 


abline(v=observedMeanDifference, col = "blue") 
pValue = 2*mean(result <= (observedMeanDifference) ) 
pValue 


After we read in the CSV, we define the function 


permutationTest = function(vectorA, vectorB, testStat) { 

N = 1045 

fullSet = c(vectorA, vectorB) 

lengthA = length(vectorA) 

lengthB = length(vectorB) 

trials <- replicate(N, 
{index <- sample(lengthB + lengthA, size = lengthA, replace = FALSE) 
testStat((fullSet[index]), fullSet[-index]) } ) 

trials 


This function takes two vectors, and shuffles their contents together, then performs the function testStat on the 
shuffled vectors. The result of teststat is added to trials, which is the return value. 


It does this N = 105 times. Note that the value N could very well have been a parameter to the function. 


This leaves us with a new Set of data, trials, the set of means that might result if there truly is no relationship 
between the two variables. 


Now to define our test statistic: 
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subtractMeans = function(a, b){ return (mean(a) - mean(b) )} 
Perform the test: 

result = permutationTest(vecl, vec2, subtractMeans) 

Calculate our actual observed mean difference: 

observedMeanDifference = subtractMeans(vec1l, vec2) 

Let's see what our observation looks like on a histogram of our test statistic. 


hist(result) 
abline(v=observedMeanDifference, col = "blue" ) 


Histogram of result 


Frequency 
10000 


0 


result 


It doesn't /ook like our observed result is very likely to occur by random chance... 


We want to calculate the p-value, the likeliehood of the original observed result if their is no relationship between 
the two variables. 


pValue = 2*mean(result >= (observedMeanDifference) ) 
Let's break that down a bit: 

result >= (observedMeanDifference) 

Will create a boolean vector, like: 

FALSE TRUE FALSE FALSE TRUE FALSE ... 


With TRUE every time the value of result is greater than or equal to the observedMean. 


The function mean will interpret this vector as 1 for TRUE and ®@ for FALSE, and give us the percentage of 1's in the 
mix, ie the number of times our shuffled vector mean difference surpassed or equalled what we observed. 


Finally, we multiply by 2 because the distribution of our test statistic is highly symmetric, and we really want to 
know which results are "more extreme" than our observed result. 
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All that's left is to output the p-value, which turns out to be @. 86093939. Interpretation of this value is subjective, 
but | would say that it looks like Vitamin C promotes tooth growth quite a lot more than Orange Juice does. 
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Chapter 64: xgboost 


Section 64.1: Cross Validation and Tuning with xgboost 


library(caret) # for dummyVars 
library(RCurl) # download https data 
library(Metrics) # calculate errors 
library(xgboost) # model 


PEPER EEE AREA PEPE EEE AEE AE EEE EE AEE ETA EE A A 

# Load data from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets.htm1) 
urlfile <- ‘https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data' 

x <- getURL(urlfile, ssl.verifypeer = FALSE) 

adults <- read.csv(textConnection(x), header=F) 


# adults <-read.csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', 
header=F) 
names(adults)=c('age','workclass', 'fnlwgt', 'education', 'educationNum'’ , 
‘maritalStatus' ,'occupation','relationship','race', 
‘sex', 'capitalGain' ,'capitalLoss' , 'hoursWeek', 
‘nativeCountry','income' ) 
# clean up data 


adultsSincome <- ifelse(adultsSincome==' <=5@K',@,1) 
# binarize all factors 

library(caret) 

dmy <- dummyVars(" ~ .", data = adults) 


adultsTrsf <- data.frame(predict(dmy, newdata = adults)) 
SEE EE A EE Ee AEE ESS Be BEE Ae ae Be Ae 


# what we're trying to predict adults that make more than 50k 
outcomeName <- c('income' ) 

# list of features 

predictors <- names(adultsTrsf)[!names(adultsTrsf) %in% outcomeName ] 


# play around with settings of xgboost - eXtreme Gradient Boosting (Tree) library 
# https://github.com/tqchen/xgboost/wiki/Parameters 

# max.depth - maximum depth of the tree 

# nrounds - the max number of iterations 


# take first 10% of the data only! 
trainPortion <- floor(nrow(adultsTrsf)*@.1) 


trainSet <- adultsTrsf[ 1:floor(trainPortion/2), ] 
testSet <- adultsTrsf[(floor(trainPortion/2)+1) :trainPortion, ] 


smallestError <- 100 
for (depth in seq(1,10,1)) { 
for (rounds in seq(1,20,1)) { 


# train 

bst <- xgboost(data = as.matrix(trainSet[,predictors]), 
label = trainSet[,outcomeName], 
max.depth=depth, nround=rounds, 
objective = "reg:linear", verbose=@) 

ge() 


# predict 


predictions <- predict(bst, as.matrix(testSet[,predictors]), outputmargin=TRUE) 
err <- rmse(as.numeric(testSet[,outcomeName]), as.numeric(predictions) ) 
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if (err < smallestError) { 
smallestError = err 
print(paste(depth, rounds, err) ) 


cv <- 30 
trainSet <- adultsTrsf[1:trainPortion, ] 
cvDivider <- floor(nrow(trainSet) / (cv+1)) 


smallestError <- 100 
for (depth in seq(1,10,1)) { 
for (rounds in seq(1,20,1)) { 
totalError <- c() 
indexCount <- 1 
for (cv in seq(1:cv)) { 
# assign chunk to data test 


dataTestIndex <- c((cv * cvDivider):(cv * cvDivider + cvDivider)) 


dataTest <- trainSet[dataTestIndex, ] 
# everything else to train 
dataTrain <- trainSet[-dataTestIndex, ] 


bst <- xgboost(data = as.matrix(dataTrain[,predictors]), 


label = dataTrain[ ,outcomeName], 

max.depth=depth, nround=rounds, 

objective = "reg:linear", verbose=0) 
gc() 


predictions <- predict(bst, as.matrix(dataTest[,predictors] ) 


outputmargin=TRUE ) 


err <- rmse(as.numeric(dataTest[ ,outcomeName]), as.numeric(predictions) ) 


totalError <- c(totalError, err) 

} 

if (mean(totalError) < smallestError) { 
smallestError = mean(totalError) 
print(paste(depth, rounds, smallestError) ) 


HHHHEHHEAHHEH HAHAHAHAHAHA HAH HAHAHAHAHA HAH AHA HEA AH AAA 
# Test both models out on full data set 


trainSet <- adultsTrsf[ 1:trainPortion, ] 


# assign everything else to test 
testSet <- adultsTrsf[(trainPortion+1) :nrow(adultsTrsf), ] 


bst <- xgboost(data = as.matrix(trainSet[,predictors]), 

label = trainSet[,outcomeName], 

max.depth=4, nround=19, objective = "reg:linear", verbose=0) 
pred <- predict(bst, as.matrix(testSet[,predictors]), outputmargin=TRUE) 
rmse(as.numeric(testSet[,outcomeName]), as.numeric(pred) ) 


bst <- xgboost(data = as.matrix(trainSet[,predictors]), 

label = trainSet[,outcomeName], 

max.depth=3, nround=28@, objective = "reg:linear", verbose=0) 
pred <- predict(bst, as.matrix(testSet[,predictors]), outputmargin=TRUE) 
rmse(as.numeric(testSet[,outcomeName]), as.numeric(pred) ) 
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Chapter 65: R code vectorization best 
practices 


Section 65.1: By row operations 


The key in vectorizing R code, is to reduce or eliminate "by row operations" or method dispatching of R functions. 


That means that when approaching a problem that at first glance requires "by row operations", such as calculating 
the means of each row, one needs to ask themselves: 


e What are the classes of the data sets I'm dealing with? 
e Is there an existing compiled code that can achieve this without the need of repetitive evaluation of R 


functions? 


e If not, can | do these operation by columns instead by row? 
e Finally, is it worth spending a lot of time on developing complicated vectorized code instead of just running a 
simple apply loop? In other words, is the data big/sophisticated enough that R can't handle it efficiently using 


a simple loop? 


Putting aside the memory pre-allocation issue and growing object in loops, we will focus in this example on how to 
possibly avoid apply loops, method dispatching or re-evaluating R functions within loops. 


A standard/easy way of calculating mean by row would be: 


apply(mtcars, 1, mean) 


Mazda RX4 
Valiant 

29 .90727 

35 .04909 

Merc 240D 

Merc 45@SL 

24 .63455 

46 .50000 
Cadillac Fleetwood 
Toyota Corolla 
66 .23273 
18.81409 

Dodge Challenger 
Porsche 914-2 

47 .24091 

24 .77909 

Ford Pantera L 
60.97182 


Mazda RX4 Wag 
Duster 360 

29 .98136 

59 .72000 

Merc 230 

Merc 45@SLC 
27 .23364 

46 .35000 
Lincoln Continental 
Toyota Corona 
66 .05855 

24 .88864 

AMC Javelin 
Lotus Europa 
46 .80773 

24 .88027 
Ferrari Dino 
34.50818 


Datsun 710 

23 .59818 

Merc 280 

31 .86000 

Chrysler Imperial 
65..97227 

Camaro Z28 
5SeLO273 


Maserati Bora 
63'715545 


But can we do better? Lets's see what happened here: 


Hornet 4 Drive 


38°,73955 


Merc 280C 


Silt S727. 


Fiat 128 


19.44091 


Pontiac Firebird 


Sp sAeS) 


Volvo 142E 
26..26273 


Hornet Sportabout 
53 .66455 

Merc 45@SE 

46 .43091 

Honda Civic 

7 A227 

Fiat X1-9 


18 .92864 


1. First, we converted a data. frame to a matrix. (Note that his happens within the apply function.) This is both 
inefficient and dangerous. a matrix can't hold several column types at a time. Hence, such conversion will 
probably lead to loss of information and some times to misleading results (compare apply(iris, 2, class) 
with str(iris) or with sapply(iris, class)). 

2. Second of all, we performed an operation repetitively, one time for each row. Meaning, we had to evaluate 
some R function nrow(mtcars) times. In this specific case, mean is not a computationally expensive function, 
hence R could likely easily handle it even for a big data set, but what would happen if we need to calculate 
the standard deviation by row (which involves an expensive square root operation)? Which brings us to the 


next point: 


Goalkicker.com - R Notes for Professionals 


267 


3. We evaluated the R function many times, but maybe there already is a compiled version of this operation? 


Indeed we could simply do: 


rowMeans(mtcars) 
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout 
Valiant Duster 360 
29 .90727 29 .98136 23), 09618 38 .73955 53 .66455 
35 .84909 59 . 72000 
Merc 24@D Merc 230 Merc 280 Merc 280C Merc 45@SE 
Merc 45@SL Merc 45@SLC 
24 .63455 27 .23364 31 .86000 SIT 8727 46 .43091 
46 .50000 46 .35000 
Cadillac Fleetwood Lincoln Continental Chrysler Imperial Fiat 128 Honda Civic 
Toyota Corolla Toyota Corona 
661523273 66 .85855 65 .97227 19.44091 17.74227 
18.81409 24 .88864 
Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9 
Porsche 914-2 Lotus Europa 
47 .24091 46 .80773 58, 75273) 5737955 18.92864 
24 .77909 24 .88027 
Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E 
60 .97182 34 .50818 63 .15545 26 .26273 


This involves no by row operations and therefore no repetitive evaluation of R functions. However, we still 
converted a data. frame to a matrix. Though rowMeans has an error handling mechanism and it won't run on a data 
set that it can't handle, it's still has an efficiency cost. 


rowMeans(iris) 
Error in rowMeans(iris) 


'x' must be numeric 
But still, can we do better? We could try instead of a matrix conversion with error handling, a different method that 
will allow us to use mtcars as a vector (because a data. frame is essentially a list and a list is a vector). 


Reduce(*+°, mtcars) /ncol(mtcars) 

[1] 29.90727 29.98136 23.59818 38.73955 53.66455 35.04989 59.72000 24.63455 27.23364 31.86000 
31.78727 46.43091 46.50000 46.35000 66.23273 66.05855 

[17] 65.97227 19.44091 17.74227 18.81409 24.88864 47.24091 46.00773 58.75273 57.37955 18.92864 
24.77909 24.88027 60.97182 34.50818 63.15545 26.26273 


Now for possible speed gain, we lost column names and error handling (including NA handling). 


Another example would be calculating mean by group, using base R we could try 


aggregate(. ~ cyl, mtcars, mean) 

cyl mpg disp hp drat wt qsec vs am gear carb 

1 4 26.66364 105.1364 82.63636 4.070909 2.285727 19.13727 @.9890989 8@.7272727 4.090909 1.545455 
26: 19274286 183) 343 12228571 se oso714 82117143 17.9774) 0-57 14286) C42 e57 14 32657143 3 A285 1 
3 8 15.10008 353.1000 209.21429 3.229286 3.999214 16.77214 @.9000000 @.1428571 3.285714 3.500000 


Still, we are basically evaluating an R function in a loop, but the loop is now hidden in an internal C function (it 
matters little whether it is a C or an R loop). 


Could we avoid it? Well there is a compiled function in R called rowsum, hence we could do: 


rowsum(mtcars[-2], mtcarsScyl)/table(mtcars$cyl1) 
mpg disp hp drat wt qsec vs am gear carb 
4 26.66364 105.1364 82.63636 4.070909 2.285727 19.13727 @.9098989 @.7272727 4.090909 1.545455 
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6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714 @.5714286 @.4285714 3.857143 3.428571 
8 15.100080 353.1008 209.21429 3.229286 3.999214 16.77214 @.08000000 8.1428571 3.285714 3.500000 


Though we had to convert to a matrix first too. 


A this point we may question whether our current data structure is the most appropriate one. Is a data. frame is the 
best practice? Or should one just switch to a matrix data structure in order to gain efficiency? 


By row operations will get more and more expensive (even in matrices) as we start to evaluate expensive functions 
each time. Lets us consider a variance calculation by row example. 


Lets say we have a matrix m: 


set.seed(10@) 

m <- matrix(sample(1e2), 10) 

m 

Pa eis al sists eo aed 

Vile 8 RK) 39 86 71 100 81 68 89 84 
| 2 16 57 80 32 82 69 11 41 92 
[Srl 62 91 53 a3 42 334 60 70 98 79 
[4a] 66 94 29 67 45 59 20 96 64 1 
[ra] 36 63 76 6 10 48 85 TAS) 99 2 
Rei | 18 4 27 19 44 56 O77, 95 26 40 
[eral 3 24 24 25 ay au 83 28 49 a7, 
[3] 46 5} 22 43 47 74 35 97 Tt 65 
[9,] 55 54 78 34 58 90 30 61 14 58 

[10, ] 88 73 38 alts} 9 72 HE 93 23 87 


One could simply do: 


apply(m, 1, var) 
[1] 871.6556 957.5111 699.2111 941.4333 1237.3333 641.8222 539.7889 759.4333 500.4889 
1255.6111 


On the other hand, one could also completely vectorize this operation by following the formula of variance 


RowVar <- function(x) { 

rowSums((x - rowMeans(x))42)/(dim(x)[2] - 1) 
} 
RowVar(m) 


[1] 871.6556 957.5111 699.2111 941.4333 1237.3333 641.8222 539.7889 759.4333 500.4889 
1255¢60 10 
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Chapter 66: Missing values 


When we don't know the value a variable takes, we say its value is missing, indicated by NA. 


Section 66.1: Examining missing data 

anyNA reports whether any missing values are present; while is.na reports missing values elementwise: 
vec <- c(1, 2, 3, NA, 5) 

anyNA(vec) 

# [1] TRUE 


is.na(vec) 
# [1] FALSE FALSE FALSE TRUE FALSE 


is.na returns a logical vector that is coerced to integer values under arithmetic operations (with FALSE=0, TRUE=1). 
We can use this to find out how many missing values there are: 


sum(is.na(vec) ) 
ee M1 


Extending this approach, we can use colSums and is.na on a data frame to count NAs per column: 


colSums(is.na(airquality) ) 
# Ozone Solar.R Wind Temp Month Day 
# 37 7 2) 7) 7) 7) 


The naniar package (currently on github but not CRAN) offers further tools for exploring missing values. 


Section 66.2: Reading and writing data with NA values 


When reading tabular datasets with the read.* functions, R automatically looks for missing values that look like 
"NA". However, missing values are not always represented by NA. Sometimes a dot (.), a hyphen(-) or a character- 
value (e.g.: empty) indicates that a value is NA. The na.strings parameter of the read. function can be used to tell 
R which symbols/characters need to be treated as NA values: 


read.csv("name_of_csv_file.csv", na.strings = "-") 
It is also possible to indicate that more than one symbol needs to be read as NA: 
read.csv('missing.csv', na.strings = c('.','-')) 


Similarly, NAS can be written with customized strings using the na argument to write.csv. Other tools for reading 
and writing tables have similar options. 


Section 66.3: Using NAs of different classes 


The symbol NA is for a logical missing value: 


class(NA) 
#[1] "logical" 


This is convenient, since it can easily be coerced to other atomic vector types, and is therefore usually the only NA 
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you will need: 


x <- c(1, NA, 1) 
class(x[2]) 
#[1] "numeric" 


If you do need a single NA value of another type, use NA_character_, NA_integer_, NA_real_ or NA_complex_. For 
missing values of fancy classes, subsetting with NA_integer_ usually works; for example, to get a missing-value 
Date: 


class(Sys.Date()[NA_integer_]) 
# [1] “Date: 


Section 66.4: TRUE/FALSE and/or NA 


NA is a logical type and a logical operator with an NA will return NA if the outcome is ambiguous. Below, NA OR TRUE 
evaluates to TRUE because at least one side evaluates to TRUE, however NA OR FALSE returns NA because we do not 
know whether NA would have been TRUE or FALSE 


NA | TRUE 
# [1] TRUE 
# TRUE | TRUE is TRUE and FALSE | TRUE is also TRUE. 


NA | FALSE 
# [1] NA 
# TRUE | FALSE is TRUE but FALSE | FALSE is FALSE. 


NA & TRUE 
# [1] NA 
# TRUE & TRUE is TRUE but FALSE & TRUE is FALSE. 


NA & FALSE 
# [1] FALSE 
# TRUE & FALSE is FALSE and FALSE & FALSE is also FALSE. 


These properties are helpful if you want to subset a data set based on some columns that contain NA. 


df <- data. frame(v1=0:9, 
v2=c(rep(1:2, each=4), NA, NA), 
v3=c(NA, letters[2:10])) 


df[dfSv2 == 1 & !is.na(df$v2), ] 
# vil v2 v3 
#1 #@ 1 <NA> 
2. ileal b 
eh PA c 
#43) d 


df[df$v2 == 1, ] 


v1 v2 v3 
#1 @ 1 <NA> 
#2 lee b 
#3 Joel c 
#4 Semel d 
#NA NA NA <NA> 


#NA.1 NA NA <NA> 
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Chapter 67: Hierarchical Linear Modeling 


Section 67.1: basic model fitting 


apologies: since / don't know of a channel for discussing/providing feedback on requests for improvement, I'm going to 


put my question here. Please feel free to point out a better place for this! @DataTx states that this is "completely 


unclear, incomplete, or has severe formatting problems". Since | don't see any big formatting problems (:-) ), a little 


bit more guidance about what's expected here for improving clarity or completeness, and why what's here is 
unsalvageable, would be useful. 


The primary packages for fitting hierarchical (alternatively "mixed" or "multilevel") linear models in R are nlme 


(older) and 1me4 (newer). These packages differ in many minor ways but should generally result in very similar fitted 


models. 


library(nlme) 

library(1lme4) 

m1.nilme <- lme(Reaction~Days, random=~Days|Subject, data=sleepstudy,method="REML" ) 
m1.lme4 <- Ilmer(Reaction~Days+(Days|Subject) , data=sleepstudy, REML=TRUE) 
all.equal(fixef(m1.nlme), fixef(m1.1me4) ) 

## [1] TRUE 


Differences to consider: 


e formula syntax is slightly different 

e nme is (still) somewhat better documented (e.g. Pinheiro and Bates 2000 Mixed-effects models in S-PLUS; 
however, see Bates et a/. 2015 Journal of Statistical Software/vignette("1mer", package="1me4") for 1me4) 
1me4 is faster and allows easier fitting of crossed random effects 

nlme provides p-values for linear mixed models out of the box, 1me4 requires add-on packages such as 


lmerTest or afex 
nlme allows modeling of heteroscedasticity or residual correlations (in space/time/phylogeny) 


The unofficial GLMM FAQ provides more information, although it is focused on generalized linear mixed models 
(GLMMs). 
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Chapter 68: *apply family of functions 
(functionals) 


Section 68.1: Using built-in functionals 


Built-in functionals: lapply(), sapply(), and mapply() 


R comes with built-in functionals, of which perhaps the most well-known are the apply family of functions. Here is a 


description of some of the most common apply functions: 


e lapply() = takes a list as an argument and applies the specified function to the list. 
e sapply() = the same as lapply() but attempts to simplify the output to a vector or a matrix. 
° vapply() =a variant of sapply() in which the output object's type must be specified. 


¢ mapply() = like lapply() but can pass multiple vectors as input to the specified function. Can be simplified 


like sapply(). 
°o Map() is an alias to mapply() with SIMPLIFY = FALSE. 


lapply() 
lapply() can be used with two different iterations: 


e lapply(variable, FUN) 

e lapply(seq_along(variable), FUN) 
# Two ways of finding the mean of x 
set.seed(1) 
df <- data.frame(x = rnorm(25), y = rnorm(25)) 


lapply(df, mean) 
lapply(seq_along(df), function(x) mean(df[[x])) 


sapply() 


sapply() will attempt to resolve its output to either a vector or a matrix. 


# Two examples to show the different outputs of sapply() 
sapply(letters, print) ## produces a vector 

x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE, FALSE, FALSE, TRUE) ) 
sapply(x, quantile) ## produces a matrix 


mapply() 
mapply() works much like lapply() except it can take multiple vectors as input (hence the m for multivariate). 


mapply(sum, 1:5, 10:6, 3) # 3 will be "recycled" by mapply 


Section 68.2: Combining multiple data.frames Clapply, 
mapply ) 


In this exercise, we will generate four bootstrap linear regression models and combine the summaries of these 
models into a single data frame. 
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library(broom) 


#*x Create the bootstrap data sets 
BootData <- lapply(1:4, 
function(i) mtcars[sample(1:nrow(mtcars) , 
size = nrow(mtcars), 
replace = TRUE), ]) 


#*x Fit the models 
Models <- lapply(BootData, 
function(BD) 1lm(mpg ~ qsec + wt + factor(am), 
data = BD)) 


#x Tidy the output into a data.frame 
Tidied <- lapply(Models, 
tidy) 


#*x Give each element in the Tidied list a name 
Tidied <- setNames(Tidied, paste@("Boot", seq_along(Tidied) )) 


At this point, we can take two approaches to inserting the names into the data.frame. 


#*x Insert the element name into the summary with “lapply~ 


#x Requires passing the names attribute to “lapply’ and referencing ~Tidied~ 


#*x the applied function. 
Described_lapply <- 
lapply(names(Tidied), 
function(nm) cbind(nm, Tidied[[nm]])) 


Combined_lapply <- do.call("rbind", Described_lapply) 


#*x Insert the element name into the summary with “mapply~ 
#x Allows us to pass the names and the elements as separate arguments. 
Described_mapply <- 
mapply( 
function(nm, dframe) cbind(nm, dframe), 
names(Tidied), 
Tidied, 
SIMPLIFY = FALSE) 


Combined_mapply <- do.call("rbind", Described_mapply) 


If you're a fan of magrittr style pipes, you can accomplish the entire task in a single chain (though it may not be 
prudent to do so if you need any of the intermediary objects, such as the model objects themselves): 


library(magrittr) 
library(broom) 
Combined <- lapply(1:4, 
function(i) mtcars[sample(1:nrow(mtcars) , 
size = nrow(mtcars), 
replace = TRUE), ]) %>% 
lapply(function(BD) 1lm( mpg ~ qsec + wt + factor(am), data = BD)) %>% 
lapply(tidy) %>% 
setNames(paste@( "Boot", seq_along(.))) %>% 
mapply(function(nm, dframe) cbind(nm, dframe), 
nm = names(.), 
dframe = ., 
SIMPLIFY = FALSE) %>% 
do.call("rbind", .) 
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Section 68.3: Bulk File Loading 


for a large number of files which may need to be operated on in a similar process and with well structured file 
names. 


firstly a vector of the file names to be accessed must be created, there are multiple options for this: 


¢ Creating the vector manually with paste@() 


files <- paste@("file_", 1:100, ".rds") 


e Using list.files() with a regex search term for the file type, requires knowledge of regular expressions 
(regex) if other files of same type are in the directory. 


files <- list.files("./", pattern = "\\.rds$", full.names = TRUE) 


where X is a vector of part of the files naming format used. 
lapp1y will output each response as element of a list. 
readRDS is specific to . rds files and will change depending on the application of the process. 


my_file_list <- lapply(files, readRDS) 


This is not necessarily faster than a for loop from testing but allows all files to be an element of a list without 
assigning them explicitly. 


Finally, we often need to load multiple packages at once. This trick can do it quite easily by applying library() to all 
libraries that we wish to import: 


lapply(c("jsonlite","stringr","igraph"), library, character .only=TRUE) 


Section 68.4: Using user-defined functionals 
User-defined functionals 


Users can create their own functionals to varying degrees of complexity. The following examples are from 
Functionals by Hadley Wickham: 


randomise <- function(f) f(runif(1e3) ) 
lapply2 <- function(x, f, ...) { 
out <- vector("list", length(x) ) 
for (i in seq_along(x)) { 
eur (al) == tO Mali) 


out 
In the first case, randomise accepts a single argument f, and calls it on a sample of Uniform random variables. To 
demonstrate equivalence, we call set.seed below: 


set.seed(123) 
randomise (mean) 
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#[1] @.4972778 


set.seed(123) 
mean(runif(1e3) ) 
#[1] @.4972778 


set.seed(123) 
randomise (max) 
#[1] @.9994045 


set.seed(123) 
max (runif(1e3) ) 
#[1] @.9994045 


The second example is a re-implementation of base: :lapply, which uses functionals to apply an operation (f) to 
each element in a list (x). The ... parameter allows the user to pass additional arguments to f, such as the na.rm 
option in the mean function: 


lapply(list(c(1, 3, 5), c(2, NA, 6)), mean) 
[[{1]] 
[1] 3 


[[2]] 
[1] NA 


# oR HR HR 


lapply2(list(c(1, 3, 5), c(2, NA, 6)), mean) 
[[1]] 
fails 


[[2]] 
[1] NA 


# oR HR H H 


lapply(list(c(1, 3, 5), c(2, NA, 6)), mean, na.rm = TRUE) 
Ch 
[1] 3 


[[2]] 
La: 


# oR HH H 


apply2(list(c(1, 3, 5), c(2, NA, 6)), mean, na.rm = TRUE) 
[[{1]] 
[1] 3 


[[2]] 
ies 


+R HH HH 
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Chapter 69: Text mining 
Section 69.1: Scraping Data to build N-gram Word Clouds 


The following example utilizes the tm text mining package to scrape and mine text data from the web to build word 
clouds with symbolic shading and ordering. 


require(RWeka) 

require(tau) 

require(tm) 
require(tm.plugin.webmining) 
require(wordcloud) 


HS CijapetGO OCC RANT AC Oe ee 
googlefinance <- WebCorpus(GoogleFinanceSource("NASDAQ:LFVN")) 


# SChADC GOOQLC™NCW See eee ee So 
lv.googlenews <- WebCorpus(GoogleNewsSource("LifeVantage" )) 

p.googlenews <- WebCorpus(GoogleNewsSource("Protandim" )) 

ts.googlenews <- WebCorpus(GoogleNewsSource("TrueScience”" ) ) 


we selge|oe) NOBUS: SaaS eee Sree ee Se eee 
lv.nytimes <- WebCorpus(NYTimesSource(query = "LifeVantage", appid = nytimes_appid) ) 
p.nytimes <- WebCorpus(NYTimesSource("Protandim", appid = nytimes_appid) ) 
ts.nytimes <- WebCorpus(NYTimesSource("TrueScience", appid = nytimes_appid) ) 


a Serene NEUES Soe aoesenece SaaS Sanaa pa Sear Sapa Sorina Sai aimsic 
lv.reutersnews <- WebCorpus(ReutersNewsSource("LifeVantage" )) 

p.reutersnews <- WebCorpus(ReutersNewsSource('"Protandim” ) ) 

ts.reutersnews <- WebCorpus(ReutersNewsSource('"TrueScience” ) ) 


ua telnelos! Vfellateroll “iralielikee) SSS a aa 
lv.yahoofinance <- WebCorpus(YahooFinanceSource("LFVN")) 


FeoChape yahOOlINCW See eee ee ee eee eee eee ee 
lv.yahoonews <- WebCorpus(YahooNewsSource("LifeVantage’" ) ) 

p.yahoonews <- WebCorpus(YahooNewsSource("Protandim" ) ) 

ts.yahoonews <- WebCorpus(YahooNewsSource("TrueScience" )) 


re elms Vinton)! Jin leNy Somat Se a Sa SaaS 
lv.yahooinplay <- WebCorpus(YahooInplaySource("LifeVantage’" ) ) 


#: Me xa aMa Ma MG stern RES WS a 

corpus <- c(googlefinance, lv.googlenews, p.googlenews, ts.googlenews, lv.yahoofinance, 
lv.yahoonews, p.yahoonews, 

ts.yahoonews, lv.yahooinplay) #lv.nytimes, p.nytimes, ts.nytimes,1lv.reutersnews, p.reutersnews, 
ts.reutersnews, 


inspect(corpus) 
wordlist <- c("lfvn", "lifevantage", "protandim", "truescience", "company", "fiscal", "nasdaq") 


ds@.1g <- tm_map(corpus, content_transformer(tolower) ) 

ds1.1g <- tm_map(ds@.1g, content_transformer(removeWords), wordlist) 

ds1.1g <- tm_map(ds1.1g, content_transformer(removeWords), stopwords("english") ) 
ds2.1g <- tm_map(ds1.1g, stripWhitespace) 

ds3.1g <- tm_map(ds2.1g, removePunctuation) 

ds4.1g <- tm_map(ds3.1g, stemDocument) 


tdm.1g <- TermDocumentMatrix(ds4.1g) 
dtm.1g <- DocumentTermMatrix(ds4.1g) 
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findFreqTerms(tdm.1g, 40) 
findFreqTerms(tdm.1g, 60) 
findFreqTerms(tdm.1g, 8@) 
findFreqterms(tdm.1g, 100) 


findAssocs(dtm.1g, "skin", .75) 
findAssocs(dtm.1g, "“scienc", .5) 
findAssocs(dtm.1g, "product", .75) 


tdm89.1g <- removeSparseTerms(tdm.1g, @.89) 
tdm9.1g <- removeSparseTerms(tdm.1g, @.9) 
tdm91.1g <- removeSparseTerms(tdm.1g, @.91) 
tdm92.1g <- removeSparseTerms(tdm.1g, @.92) 


tdm2.1g <- tdm92.1g 


# Creates a Boolean matrix (counts # docs w/terms, not raw # terms) 
tdm3.1g <- inspect(tdm2.1g) 
tdm3.1g[tdm3.1g>=1] <- 1 


# Transform into a term-term adjacency matrix 
termMatrix.1gram <- tdm3.1g %*% t(tdm3.1g) 


# inspect terms numbered 5 to 10 
termMatrix.1gram[5:10,5:10] 
termMatrix.1gram[1:10,1:10] 


# Create a WordCloud to Visualize the Text Data --------------------------- 
notsparse <- tdm2.1g 

m = as.matrix(notsparse) 

v = sort(rowSums(m) , decreasing=TRUE ) 

d = data.frame(word = names(v), freq=v) 


# Create the word cloud 

pal = brewer.pal(9, “BuPu" ) 

wordcloud(words = dSword, 
freq = dSfreq, 
scale = c(3,.8), 
random.order = F, 
colors = pal) 
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un iS 
growth °< 
product 


Note the use of random. order and a sequential pallet from RColorBrewer, which allows the programmer to capture 
more information in the cloud by assigning meaning to the order and coloring of terms. 


Above is the 1-gram case. 


We can make a major leap to n-gram word clouds and in doing so we'll see how to make almost any text-mining 
analysis flexible enough to handle n-grams by transforming our TDM. 


The initial difficulty you run into with n-grams in R is that tm, the most popular package for text mining, does not 
inherently support tokenization of bi-grams or n-grams. Tokenization is the process of representing a word, part of 
a word, or group of words (or symbols) as a single data element called a token. 


Fortunately, we have some hacks which allow us to continue using tm with an upgraded tokenizer. There’s more 
than one way to achieve this. We can write our own simple tokenizer using the textcnt() function from tau: 


tokenize_ngrams <- function(x, n=3) 
return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n))))) 


or we can invoke RWeka's tokenizer within tm: 


# BigramTokenize 
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) 


From this point you can proceed much as in the 1-gram case: 


# Create an n-gram Word Cloud --------------------- 02020 n cece e nce n nena cee 
tdm.ng <- TermDocumentMatrix(ds5.1g, control = list(tokenize = BigramTokenizer) ) 
dtm.ng <- DocumentTermMatrix(ds5.1g, control = list(tokenize = BigramTokenizer) ) 


# Try removing sparse terms at a few different levels 
tdm89.ng <- removeSparseTerms(tdm.ng, 0.89) 
tdm9.ng <- removeSparseTerms(tdm.ng, 0.9) 
tdm91.ng <- removeSparseTerms(tdm.ng, 0.91) 
tdm92.ng <- removeSparseTerms(tdm.ng, 0.92) 
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notsparse <- tdm91.ng 

m = as.matrix(notsparse) 

Vv sort (rowSums(m) ,decreasing=TRUE) 
d = data. frame(word = names(v),freq=v) 


# Create the word cloud 
pal = brewer.pal(9,"BuPu") 
wordcloud(words = d$word, 
freq = d$freq, 

scale = c(3,.8), 
random.order = F, 

colors = pal) 


| — ~ ~ I ~ 
| — i 4 - r r CaF iio 


care regimen 
scientif valid end june 
salt lake 
full year 
june 30 oxid stress 


skin care 
busi opportun 


The example above is reproduced with permission from Hack-R's data science blog. Additional commentary may be 
found in the original article. 
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Chapter 70: ANOVA 


Section 70.1: Basic usage of aov() 


Analysis of Variance (aov) is used to determine if the means of two or more groups differ significantly from each 
other. Responses are assumed to be independent of each other, Normally distributed (within each group), and the 
within-group variances are assumed equal. 


In order to complete the analysis data must be in long format (see reshaping data topic). aov() is a wrapper around 
the 1m() function, using Wilkinson-Rogers formula notation y~f where y is the response (independent) variable and 
f is a factor (categorical) variable representing group membership. /f f is numeric rather than a factor variable, aov() 

will report the results of a linear regression in ANOVA format, which may surprise inexperienced users. 


The aov() function uses Type | (sequential) Sum of Squares. This type of Sum of Squares tests all of the (main and 
interaction) effects sequentially. The result is that the first effect tested is also assigned shared variance between it 
and other effects in the model. For the results from such a model to be reliable, data should be balanced (all groups 
are of the same size). 


When the assumptions for Type | Sum of Squares do not hold, Type II or Type Ill Sum of Squares may be applicable. 
Type Il Sum of Squares test each main effect after every other main effect, and thus controls for any overlapping 
variance. However, Type Il Sum of Squares assumes no interaction between the main effects. 


Lastly, Type II Sum of Squares tests each main effect after every other main effect and every interaction. This 
makes Type Ill Sum of Squares a necessity when an interaction is present. 


Type Il and Type III Sums of Squares are implemented in the Anova() function. 


Using the mtcars data set as an example. 

mtCarsAnovaModel <- aov(wt ~ factor(cyl), data=mtcars) 
To view summary of ANOVA model: 

summary (mtCarsAnovaModel) 

One can also extract the coefficients of the underlying 1m() model: 


coefficients(mtCarsAnovaModel) 


Section 70.2: Basic usage of Anova() 


When dealing with an unbalanced design and/or non-orthogonal contrasts, Type II or Type III Sum of Squares are 
necessary. The Anova() function from the car package implements these. Type II Sum of Squares assumes no 
interaction between main effects. If interactions are assumed, Type III Sum of Squares is appropriate. 


The Anova() function wraps around the 1m() function. 


Using the mtcars data sets as an example, demonstrating the difference between Type II and Type III when an 
interaction is tested. 


> Anova(1m(wt ~ factor(cyl)*factor(am), data=mtcars), type = 2) 
Anova Table (Type II tests) 
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Response: wt 
Sum Sq Df F value Pr(>F) 


factor(cyl) 7.2278 2 11.5266 0.0002606 xxx 
factor(am) 3.2845 1 10.4758 6@.0032895 xx 
factor(cyl):factor(am) 0.0668 2 0.1065 0.8993714 
Residuals Sriliodize 26 


Signif. codes: @ ‘x*x*’ 9.001 ‘**’ @.01 ‘* @.05 ‘.’ 0.1 ‘7 1 


> Anova(1m(wt ~ factor(cyl)*factor(am), data=mtcars), type = 3) 
Anova Table (Type III tests) 


Response: wt 
Sum Sq Df F value Pr(>F) 


(Intercept) 25.8427 1 82.4254 1.524e-69 «xxx 
factor(cyl) 4.0124 2 6.3988 0.005498 xx 
factor(am) 1.7389 1 5.5463 0.026346 « 
factor(cyl):factor(am) 0.0668 2 0.1065 0.899371 
Residuals 82 IS17 26 


Signif. codes: @ ‘x*x*’ 9.001 ‘**’ @.01 ‘* @.05 ‘.’ 0.1 ‘7 1 
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Chapter 71: Raster and Image Analysis 


See also I/O for Raster Images 


Section 71.1: Calculating GLCM Texture 


Gray Level Co-Occurrence Matrix (Haralick et al. 1973) texture is a powerful image feature for image analysis. The 


glcm package provides a easy-to-use function to calculate such texutral features for RasterLayer objects in R. 


library(glcm) 
library(raster) 


r <- raster("C:/Program Files/R/R-3.2.3/doc/html/logo.jpg") 


plot(r) 
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Calculating GLCM textures in one direction 


rglcm <- glcm(r, 


window = c(9,9), 
shift = ¢e(1,1), 


statistics 


) 


plot(rglcm) 


c("mean", "variance", "homogeneity", "contrast", 


"dissimilarity", "entropy", "second_moment") 
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Calculation rotation-invariant texture features 


The textural features can also be calculated in all 4 directions (0°, 45°, 90° and 135°) and then combined to one 


rotation-invariant texture. The key for this is the shift parameter: 


rglcm1 <- glcem(r, 
window = c(9,9), 
shift-liet(e(O,1), e(1, 1), e108), (7, -1)), 


statistics = c("mean", "variance", "homogeneity", "contrast", 


"dissimilarity", "entropy", "second_moment") 


) 


plot(rglcm1) 
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Section 71.2: Mathematical Morphologies 


The package mmand provides functions for the calculation of Mathematical Morphologies for n-dimensional arrays. 
With a little workaround, these can also be calculated for raster images. 


library(raster) 
library (mmand) 


r <- raster("C:/Program Files/R/R-3.2.3/doc/html/logo.jpg") 
plot(r) 


10 20 30 40 50 60 70 


0 
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At first, a kernel (moving window) has to be set with a size (e.g. 9x9) and a shape type (e.g. disc, box or diamond) 
sk <- shapeKernel(c(9,9), type="disc") 
Afterwards, the raster layer has to be converted into an array wich is used as input for the erode() function. 


rArr <- as.array(r, transpose = TRUE) 
rErode <- erode(rArr, sk) 
rErode <- setValues(r, as.vector(aperm(rErode) ) ) 


Besides erode(), also the morphological functions dilate(), opening() and closing() can be applied like this. 


plot(rErode) 


10 20 30 40 50 60 70 


0 
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Chapter 72: Survival analysis 


Section 72.1: Random Forest Survival Analysis with 
randomForestSRC 


Just as the random forest algorithm may be applied to regression and classification tasks, it can also be extended to 
survival analysis. 


In the example below a survival model is fit and used for prediction, scoring, and performance analysis using the 
package randomForestSRC from CRAN. 


require(randomForestSRC) 


set.seed(130948) #0ther seeds give similar comparative results 
x1 <- runif (1000) 


y <- rnorm(1000, mean = x1, sd = .3) 
data <- data.frame(x1 = x1, y = y) 
head(data) 
x1 y 

1 @.9604353 1.3549648 

2 @.3771234 @.2961592 

3 @.7844242 @.6942191 

4 @.9860443 1.5348900 

5 @.1942237 @.4629535 

6 @.7442532 -@.0672639 


(modRFSRC <- rfsrc(y ~ x1, data = data, ntree=500, nodesize = 5)) 


Sample size: 1000 
Number of trees: 500 
Minimum terminal node size: 5 
Average no. of terminal nodes: 208.258 
No. of variables tried at each split: 1 
Total no. of variables: 1 
Analysis: RF-R 
Family: regr 
Splitting rule: mse 
% variance explained: 32.08 
Error rate: @.11 


xInew <- runif(10000) 
ynew <- rnorm(1000@, mean = xinew, sd = .3) 
newdata <- data.frame(x1 = x1new, y = ynew) 


survival.results <- predict(modRFSRC, newdata = newdata) 
survival.results 


Sample size of test (predict) data: 19000 

Number of grow trees: 500 
Average no. of grow terminal nodes: 208.258 
Total no. of grow variables: 1 

Analysis: RF-R 

Family: regr 

% variance explained: 34.97 

Test set error rate: @.11 
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Section 72.2: Introduction - basic fitting and plotting of 
parametric survival models with the survival package 


survival is the most commonly used package for survival analysis in R. Using the built-in lung dataset we can get 
started with Survival Analysis by fitting a regression model with the survreg() function, creating a curve with 
survfit(), and plotting predicted survival curves by calling the predict method for this package with new data. 


In the example below we plot 2 predicted curves and vary sex between the 2 sets of new data, to visualize its effect: 


require(survival) 
s <- with(lung, Surv(time, status) ) 


sWei <- survreg(s ~ as.factor(sex)+age+ph.ecog+wt.loss+ph.karno, dist='weibull' ,data=lung) 


fitKM <- survfit(s ~ sex, data=lung) 


plot(fitKM) 

lines(predict(sWei, newdata = list(sex = 1, 
age = 
ph.ecog = 1, 
ph.karno = 98, 


wt.loss = 2), 
type = "quantile", 
p = seq(.@1, .99, by = .@1)) 
seq(.99, .01, by =-.01), 
col = "blue") 


lines(predict(sWei, newdata = list(sex = 2, 
age = 1, 
ph.ecog = 1, 


ph.karno = 98, 
wt.loss = 2), 
type = "quantile", 


p = seq(.@1, .99, by = .@1)), 
seq(.99, .01, by =-.@1), 
col = "red") 
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Section 72.3: Kaplan Meier estimates of survival curves and 


risk set tables with survminer 
Base plot 


install.packages('survminer' ) 
source("https://bioconductor.org/biocLite.R" ) 
biocLite("RTCGA.clinical") # data for examples 
library(RTCGA.clinical) 
survivalTCGA(BRCA.clinical, OV.clinical, 

extract.cols = "admin.disease_code") -> BRCAOV.survInfo 
library(survival) 
fit <- survfit(Surv(times, patient.vital_status) ~ admin.disease_code, 

data = BRCAOV.survinfo) 

library(survminer ) 
ggsurvplot(fit, risk.table = TRUE) 
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Strata ~~ admin.disease_code=brca ~~ admin.disease_code=ov 
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Time 
More advanced 
ggsurvplot( 
fit, # survfit object with calculated statistics. 
risk.table = TRUE, # show risk table. 
pval = TRUE, # show p-value of log-rank test. 
conf.int = TRUE, # show confidence intervals for 
# point estimaes of survival curves. 
xlim = c(0,2000), # present narrower X axis, but not affect 
# survival estimates. 
break.time.by = 500, # break X axis in time intervals by 500. 
ggtheme = theme_RTCGA(), # customize plot and risk table with a theme. 
risk.table.y.text.col = T, # colour risk table text annotations. 
risk.table.y.text = FALSE # show bars instead of names in text annotations 
# in legend of risk table 
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Strata += admin.disease_code=brea “= admin.disease_code=-ov 


1.00 

0.75 
= 
8 
2 

° 

6.0.50 
o 
2 

2 

P= | 
iv) 

0.25 

0.00 

0 500 1000 1500 2000 
Time 
Number at risk by time 
s = 1096 504 320 195 111 
i] 
DA — 1098 380 239 121 57 
0 500 1000 1500 2000 
Time 

Based on 


http://r-addict.com/2016/05/23/Informative-Survival-Plots.html 
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Chapter 73: Fault-tolerant/resilient code 


Parameter Details 
In case the "try part" was completed successfully tryCatch will return the last evaluated 
expression. Hence, the actual value being returned in case everything went well and there is no 
expr condition (i.e. a warning or an error) is the return value of readLines. Note that you don't need to 
explicilty state the return value via return as code in the "try part" is not wrapped insided a 
function environment (unlike that for the condition handlers for warnings and error below) 


Provide/define a handler function for all the conditions that you want to handle explicitly. AFAIU, 
you can provide handlers for any type of conditions (not just warnings and errors, but also 

warning/error/etc custom conditions; see simpleCondition and friends for that) as long as the name of the 
respective handler function matches the class of the respective condition (see the Details 
part of the doc for tryCatch). 


Here goes everything that should be executed at the very end, regardless if the expression in 
the "try part" succeeded or if there was any condition. If you want more than one expression to 
be executed, then you need to wrap them in curly brackets, otherwise you could just have 
written finally = <expression> (i.e. the same logic as for "try part". 


finally 


Section 73.1: Using tryCatchQ 


We're defining a robust version of a function that reads the HTML code from a given URL. Robust in the sense that 
we want it to handle situations where something either goes wrong (error) or not quite the way we planned it to 
(warning). The umbrella term for errors and warnings is condition 


Function definition using tryCatch 


readUrl <- function(url) { 
out <- tryCatch( 


PRESET ES EATER EPEAT HEE EEE TEE AEE AEE FEE AEE HE A LE 
# Try part: define the expression(s) you want to "try" # 
BERESRSPE TE ETERS TEE EERE SETAE PETE HEE AEST EE EE A a 


# Just to highlight: 

# If you want to use more than one R expression in the "try part" 
# then you'll have to use curly brackets. 

# Otherwise, just write the single expression you want to try and 


message("This is the ‘try' part") 
readLines(con = url, warn = FALSE) 


}, 


PRES PER TE EAHA SEE ATE EEE AEE EA ESE HE HE EEE HE EEE EE 
# Condition handler part: define how you want conditions to be handled # 
PEPER EEE ARE EEE EERE AE EEE ER AE EEE SEE EAE EEE EAE EEE EE A 


# Handler when a warning occurs: 
warning = function(cond) { 


message(paste("Reading the URL caused a warning:", url) ) 
message("Here's the original warning message:") 
message(cond) 


# Choose a return value when such a type of condition occurs 
return(NULL) 
Ne 


# Handler when an error occurs: 
error = function(cond) { 
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message(paste("This seems to be an invalid URL:", url)) 
message("Here's the original error message:") 
message(cond) 


# Choose a return value when such a type of condition occurs 
return(NA) 
i 


PRES SPEER REESE ETE RATED SEA AES TEE EE EE A 
# Final part: define what should happen AFTER # 
# everything has been tried and/or handled # 
PERSE PATER EB ATER RAPS AEE TEE AEE AEE EE A 


finally = { 
message(paste("Processed URL:", url)) 
message('"Some message at the end\n") 
} 
) 
return(out) 


Testing things out 


Let's define a vector of URLs where one element isn't a valid URL 


urls <- c( 
"http://stat.ethz.ch/R-manual/R-devel/library/base/html/connections.html", 
"http://en.wikipedia.org/wiki/Xz", 

[Lam mo uri. 


And pass this as input to the function we defined above 


<- lapply(urls, readUrl) 
Processed URL: http://stat.ethz.ch/R-manual/R-devel/library/base/html/connections .html 
Some message at the end 


Processed URL: http://en.wikipedia.org/wiki/Xz 
Some message at the end 


URL does not seem to exist: I'm no URL 
Here's the original error message: 
cannot open the connection 

Processed URL: I'm no URL 

Some message at the end 


Warning message: 
In file(con, "r") : cannot open file ‘I'm no URL 


HR HH HHH HH HHH HH HM 


: No such file or directory 


Investigating the output 


length(y) 
ex ii) 3S 


head(y[[1]]) 

[1] "“<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">" 

[2] “<html><head><title>R: Functions to Manipulate Connections</title>" 

[3] “<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">" 
[4] "<link rel=\"stylesheet\" type=\"text/css\" href=\"R.css\">" 

[5] "</head><body>" 


# oR HH H 
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aaa 


y{[3]] 
# [1] NA 
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Chapter 74: Reproducible R 


With 'Reproducibility' we mean that someone else (perhaps you in the future) can repeat the steps you performed 
and get the same result. See the Reproducible Research Task View. 


Section 74.1: Data reproducibility 
dput() and dget() 


The easiest way to share a (preferable small) data frame is to use a basic function dput(). It will export an R object 
in a plain text form. 


Note: Before making the example data below, make sure you're in an empty folder you can write to. Run getwd() and 
read ?setwd if you need to change folders. 


dput(mtcars, file = ‘df.txt') 
Then, anyone can load the precise R object to their GlobalEnvironment using the dget() function. 
df <- dget('df.txt' ) 


For larger R objects, there are a number of ways of saving them reproducibly. See Input and output . 


Section 74.2: Package reproducibility 


Package reproducibility is a very common issue in reproducing some R code. When various packages get updated, 
some interconnections between them may break. The ideal solution for the problem is to reproduce the image of 
the R code writer's machine on your computer at the date when the code was written. And here comes checkpoint 
package. 


Starting from 2014-09-17, the authors of the package make daily copies of the whole CRAN package repository to 
their own mirror repository -- Microsoft R Archived Network. So, to avoid package reproduciblity issues when 
creating a reproducible R project, all you need is to: 


1. Make sure that all your packages (and R version) are up-to-date. 
2. Include checkpoint : :checkpoint(' YYYY-MM-DD') line in your code. 


checkpoint will create a directory .checkpoint in your R_home directory ("~/"). To this technical directory it will 
install all the packages, that are used in your project. That means, checkpoint looks through all the .R files in your 
project directory to pick up all the library() or require() calls and install all the required packages in the form 
they existed at CRAN on the specified date. 


PRO You are freed from the package reproducibility issue. 
CONTRA For each specified date you have to download and install all the packages that are used in a certain 
project that you aim to reproduce. That may take quite a while. 
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Chapter 75: Fourier Series and 
Transformations 


The Fourier transform decomposes a function of time (a signal) into the frequencies that make it up, similarly to 
how a musical chord can be expressed as the amplitude (or loudness) of its constituent notes. The Fourier 
transform of a function of time itself is a complex-valued function of frequency, whose absolute value represents 
the amount of that frequency present in the original function, and whose complex argument is the phase offset of 
the basic sinusoid in that frequency. 


The Fourier transform is called the frequency domain representation of the original signal. The term Fourier 
transform refers to both the frequency domain representation and the mathematical operation that associates the 
frequency domain representation to a function of time. The Fourier transform is not limited to functions of time, 
but in order to have a unified language, the domain of the original function is commonly referred to as the time 
domain. For many functions of practical interest one can define an operation that reverses this: the inverse Fourier 
transformation, also called Fourier synthesis, of a frequency domain representation combines the contributions of 
all the different frequencies to recover the original function of time. 


Linear operations performed in one domain (time or frequency) have corresponding operations in the other 
domain, which are sometimes easier to perform. The operation of differentiation in the time domain corresponds 
to multiplication by the frequency, so some differential equations are easier to analyze in the frequency domain. 
Also, convolution in the time domain corresponds to ordinary multiplication in the frequency domain. Concretely, 
this means that any linear time-invariant system, such as an electronic filter applied to a signal, can be expressed 
relatively simply as an operation on frequencies. So significant simplification is often achieved by transforming time 
functions to the frequency domain, performing the desired operations, and transforming the result back to time. 


Harmonic analysis is the systematic study of the relationship between the frequency and time domains, including 
the kinds of functions or operations that are "simpler" in one or the other, and has deep connections to almost all 
areas of modern mathematics. 


Functions that are localized in the time domain have Fourier transforms that are spread out across the frequency 
domain and vice versa. The critical case is the Gaussian function, of substantial importance in probability theory 
and statistics as well as in the study of physical phenomena exhibiting normal distribution (e.g., diffusion), which 
with appropriate normalizations goes to itself under the Fourier transform. Joseph Fourier introduced the 
transform in his study of heat transfer, where Gaussian functions appear as solutions of the heat equation. 


The Fourier transform can be formally defined as an improper Riemann integral, making it an integral transform, 
although this definition is not suitable for many applications requiring a more sophisticated integration theory. 


For example, many relatively simple applications use the Dirac delta function, which can be treated formally as if it 
were a function, but the justification requires a mathematically more sophisticated viewpoint. The Fourier 
transform can also be generalized to functions of several variables on Euclidean space, sending a function of 3- 
dimensional space to a function of 3-dimensional momentum (or a function of space and time to a function of 4- 
momentum). 


This idea makes the spatial Fourier transform very natural in the study of waves, as well as in quantum mechanics, 
where it is important to be able to represent wave solutions either as functions either of space or momentum and 
sometimes both. In general, functions to which Fourier methods are applicable are complex-valued, and possibly 
vector-valued. Still further generalization is possible to functions on groups, which, besides the original Fourier 
transform on R or Rn (viewed as groups under addition), notably includes the discrete-time Fourier transform 
(DTFT, group = Z), the discrete Fourier transform (DFT, group = Z mod N) and the Fourier series or circular Fourier 
transform (group = S1, the unit circle = closed finite interval with endpoints identified). The latter is routinely 


Goalkicker.com - R Notes for Professionals 296 


employed to handle periodic functions. The Fast Fourier transform (FFT) is an algorithm for computing the DFT. 


Section 75.1: Fourier Series 


Joseph Fourier showed that any periodic wave can be represented by a sum of simple sine waves. This sum is called 
the Fourier Series. The Fourier Series only holds while the system is linear. If there is, eg, some overflow effect (a 
threshold where the output remains the same no matter how much input is given), a non-linear effect enters the 
picture, breaking the sinusoidal wave and the superposition principle. 


# Sine waves 

xs <- seq(-2*pi,2*pi, pi/100) 

wave.1 <- sin(3*xs) 

wave.2 <- sin(10*xs) 

par(mfrow = c(1, 2)) 

plot(xs,wave.1, type="1",ylim=c(-1,1)); abline(h=@, 1ty=3) 
plot(xs,wave.2, type="1",ylim=c(-1,1)); abline(h=@, 1ty=3) 


# Complex Wave 


wave.3 <- @.5 * wave.1 + @.25 * wave.2 
plot(xs,wave.3,type="1"); title("Eg complex wave"); abline(h=@, lty=3) 
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Eg complex wave 
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wave.4 <- wave.3 

wave.4[wave.3>0.5] <- 0.5 

plot(xs,wave.4, type="1", ylim=c(-1.25,1.25)) 
title("overflowed, non-linear complex wave") 
abline(h=0, lty=3) 
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overflowed, non-linear complex wave 
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Also, the Fourier Series only holds if the waves are periodic, ie, they have a repeating pattern (non periodic waves 
are dealt by the Fourier Transform, see below). A periodic wave has a frequency f and a wavelength A (a wavelength 
is the distance in the medium between the beginning and end of a cycle, A=v/f0, where v is the wave velocity) that 
are defined by the repeating pattern. A non-periodic wave does not have a frequency or wavelength. 


Some concepts: 


e The fundamental period, T, is the period of all the samples taken, the time between the first sample and the 
last 

e The sampling rate, sr, is the number of samples taken over a time period (aka acquisition frequency). For 
simplicity we will make the time interval between samples equal. This time interval is called the sample 
interval, si, which is the fundamental period time divided by the number of samples N. So, si= TN 

e The fundamental frequency, f0, which is 1T. The fundamental frequency is the frequency of the repeating 
pattern or how long the wavelength is. In the previous waves, the fundamental frequency was 12r1r. The 
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frequencies of the wave components must be integer multiples of the fundamental frequency. f0 is called the 
first harmonic, the second harmonic is 2+f0, the third is 3+f0, etc. 


repeat.xs <- seq(-2*pi, 0, pi/100) 
wave.3.repeat <- @.5*sin(3*repeat.xs) + @.25*sin(1@*repeat.xs) 
plot(xs,wave.3, type="1") 


title("Repeating pattern") 


points(repeat.xs, wave.3.repeat, type="1",col="red"); 
abline(h=0, v=c(-2*pi, @) , lty=3) 


Repeating pattern 
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XS 
Here’s a R function for plotting trajectories given a fourier series: 


plot.fourier <- function(fourier.series, f.0, ts) { 
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w <- 2*pixf.@ trajectory <- sapply(ts, function(t) fourier.series(t,w)) 
plot(ts, trajectory, type="1", xlab="time", ylab="f(t)"); 
abline(h=0, lty=3) } 
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Chapter 76: .Rprofile 


Section 76.1: .Rprofile - the first chunk of code executed 


.Rprofile is a file containing R code that is executed when you launch R from the directory containing the 
.Rprofile file. The similarly named Rprofile.site, located in R's home directory, is executed by default every time 
you load R from any directory. Rprofile.site and to a greater extend .Rprofile can be used to initialize an R 
session with personal preferences and various utility functions that you have defined. 


Important note: if you use RStudio, you can have a separate .Rprofile in every RStudio project directory. 


Here are some examples of code that you might include in an .Rprofile file. 


Setting your R home directory 


# set R_home 
Sys.setenv(R_USER="c:/R_home") # just an example directory 
# but don't confuse this with the SR_HOME environment variable. 


Setting page size options 


options (papersize="a4" ) 
options(editor="notepad") 
options(pager="internal") 


set the default help type 

options(help_type="htm1") 
set a site library 

.Library.site <- file.path(chartr("\\", "/", R.home()), "site-library") 
Set a CRAN mirror 


local({r <- getOption("repos") 
r["CRAN"] <- "“http://my.local.cran" 
options(repos=r)}) 


Setting the location of your library 
This will allow you to not have to install all the packages again with each R version update. 


# library location 
. LibPaths("c:/R_home/Rpackages/win" ) 


Custom shortcuts or functions 


Sometimes it is useful to have a shortcut for a long R expression. A common example of this setting an active 
binding to access the last top-level expression result without having to type out .Last.value: 


makeActiveBinding(".", function(){.Last.value}, .GlobalEnv) 


Because .Rprofile is just an R file, it can contain any arbitrary R code. 
Pre-loading the most useful packages 


This is bad practice and should generally be avoided because it separates package loading code from the scripts 
where those packages are actually used. 


See Also 
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See help(Startup) for all the different startup scripts, and further aspects. In particular, two system-wide Profile 
files can be loaded as well. The first, Rprofile, may contain global settings, the other file Profile.site may contain 
local choices the system administrator can make for all users. Both files are found in the ${RHOME}/etc directory of 
the R installation. This directory also contains global files Renviron and Renviron.site which both can be 
completemented with a local file ~/ .Renviron in the user's home directory. 


Section 76.2: .Rprofile example 


Startup 


# Load library setwidth on start - to set the width automatically. 
.First <- function() { 

library(setwidth) 

# If 256 color terminal - use library colorout. 

if (Sys.getenv("TERM") %in% c("xterm-256color", "“screen-256color")) { 

library("colorout") 

} 

} 


Options 
# Select default CRAN mirror for package installation. 
options(repos=c(CRAN="https://cran.gis-lab.info/")) 


# Print maximum 1000 elements. 
options (max .print=1000) 


# No scientific notation. 
options(scipen=10) 


# No graphics in menus. 
options(menu. graphics=FALSE ) 


# Auto-completion for package names. 
utils: :re.settings(ipck=TRUE) 


Custom Functions 


# Invisible environment to mask defined functions 
.env = new.env() 


# Quit R without asking to save. 
.envSq <- function (save="no", ...) { 
quit(save=save, ...) 


} 


# Attach the environment to enable functions. 
attach(.env, warn.conflicts=FALSE) 
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Chapter 77: dplyr 


Section 77.1: dplyr's single table verbs 


dplyr introduces a grammar of data manipulation in R. It provides a consistent interface to work with data no 
matter where it is stored: data.frame, data.table, or a database. The key pieces of dplyr are written using Rcpp, 
which makes it very fast for working with in-memory data. 


dplyr's philosophy is to have small functions that do one thing well. The five simple functions (filter, arrange, 
SELECT, mutate, and summarise) can be used to reveal new ways to describe data. When combined with group_by, 
these functions can be used to calculate group wise summary statistics. 


Syntax commonalities 
All these functions have a similar syntax: 


e The first argument to all these functions is always a data frame 

e Columns can be referred directly using bare variable names (i.e., without using $) 

e These functions do not modify the original data itself, i.e., they don't have side effects. Hence, the results 
should always be saved to an object. 


We will use the built-in mtcars dataset to explore dplyr's single table verbs. Before converting the type of mtcars to 
tb1_df (since it makes printing cleaner), we add the rownames of the dataset as a column using rownames_to_column 
function from the tibble package. 


library(dplyr) # This documentation was written using version 0.5.0 
mtcars_tbl <- as_data_frame(tibble: :rownames_to_column(mtcars, "“cars")) 


# examine the structure of data 
head(mtcars_tbl) 


# A tibble: 6 x 12 


# cars mpg cyl disp hp drat wt qsec vs am gear carb 
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 
#1 Mazda RX4 21.0 6 160 110 3.980 2.620 16.46 7) 1 4 4 
#2 Mazda RX4 Wag 21.0 6 16GP WG 3290 2.875 17.02 7) 1 4 4 
#3 Datsun 710 22.8 4 108 03 3.180 2.520) 18.61 1 1 4 1 
#4 Hornet 4 Drive 21.4 6 2585) 111@ 3.087 -3.2:15" 19.44 1 7) 3 1 
#5 Hornet Sportabout 18.7 8 360 175 3.53440) 17 202 3) 2) 3 2 
#6 Valiant 18.1 69 2255) 1105: 2.76 37460):20).22 1 7) 3 1 
filter 


filter helps subset rows that match certain criteria. The first argument is the name of the data. frame and the 
second (and subsequent) arguments are the criteria that filter the data (these criteria should evaluate to either TRUE 
or FALSE) 


Subset all cars that have 4 cylinders - cy1: 
filter(mtcars_tbl, cyl == 4) 


# A tabbiles aii «x 12 


# cars mpg cyl disp hp drat wt qsec vs am gear carb 
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 
#1 Datsun 710 22.8 4 108.0 037-3785. 22320) 18560 1 1 4 1 
#2 Merc 240D 24.4 4 146.7 62 3.69 3.198 20.00 1 2) 4 2 
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#3 Metcu230 5 2258 4 140.8 99 G9 2237 90 22;.90 1 ) 4 iz 


#4 Rtas 128) 32.4 LR TAS CFE 66 4.08 2.200 19.47 1 1 4 1 
#5 Honda Civic 30.4 AM Wha BY ANeCey al aoMlsy shay 1 1 4 2 
# ... with 6 more rows 


We can pass multiple criteria separated by a comma. To subset the cars which have either 4 or 6 cylinders - cyl and 
have 5 gears - gear: 


filter(mtcars_tbl, cyl == 4 | cyl == 6, gear == 5) 


# A tibble: 3 x 12 


# cars mpg cyl disp hp drat wt qsec vs am gear carb 
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <db1> 
#1 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 7) 1 5 2 
#2 \Lotus Europa 30.4 4 95.1 Wakes ehe/7E ube Nae) 1 1 5 2 
#3 Ferrari Dino 19.7 6145.0 75.) 3.625-2.770)) 15.2.5 7) 1 5 6 


filter selects rows based on criteria, to select rows by position, use slice. slice takes only 2 arguments: the first 
one is a data. frame and the second is integer row values. 


To select rows 6 through 9: 


slice(mtcars_tbl, 6:9) 


# A tibbiles 4 x 12 


# cars mpg cyl disp hp drat wt qsec vs am gear carb 
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <db1> 
#1 Valiant 18.1 69225..0 5 105. 2276 3246) 2022 1 Q 3 1 
#2 Duster 360 14.3 Be3605 Oe 245 32) Sis07 sed 2) 2) 3 4 
#3 Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 2) 4 2 
#4 Merc 230 22.8 4 148.8 Ob ts 925-3 lo 22,90 1 2) 4 2 


Or: 
slice(mtcars_tbl, -c(1:5, 10:n())) 


This results in the same output as slice(mtcars_tbl, 6:9) 
n() represents the number of observations in the current group 
arrange 


arrange is used to sort the data by a specified variable(s). Just like the previous verb (and all other functions in 
dplyr), the first argument is a data. frame, and consequent arguments are used to sort the data. If more than one 
variable is passed, the data is first sorted by the first variable, and then by the second variable, and so on.. 


To order the data by horsepower - hp 


arrange(mtcars_tbl, hp) 


# A tibble: 32 x 12 


# cars mpg cyl disp hp drat wt qsec vs am gear carb 
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 
#1 Honda Civic 30.4 A TSR 52 AL OS Meola rol 1 1 4 2 
#2 Merc 240D 24.4 4 146.7 62 3.69 3.198 20.00 1 2) 4 2 
#3 Toyota Corolla 33.9 Aral 65) 4522171835 19298 1 1 4 1 
#4 Fiat 128 3254 ADB) 66 4.08 2.208 19.47 1 1 4 1 
#5 Fara X19)" 27) <3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 
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#6 


He os 


To arrange the data by miles per gallon - mpg in descending order, followed by number of cylinders - cyl: 


Porsche 914-2 26.0 


with 26 more rows 


4 120 


so) 


arrange(mtcars_tbl, desc(mpg), cyl) 


#A talbbiles 32 5¢ 1/2 

# cars m 
# <chr> <db 
#1 Toyota Corolla 33. 
#2 Rataite 28 meto 2 
#3 Honda Civic 30. 
#4 Lotus Europa 38. 
#5 Fiat X1-9 27. 
#6 Porsche 914-2 26. 
# =... with 26 more row 
select 


SELECT is used to select only a subset of variables 


SELECT(mtcars_tbl 


Pg 
eS 
9 


Oowspry Ss 


Ss 


<d 


# A tibble: 32 x 5 

# mpg disp wt qsec 
# dbl- <dbl> <dbl> <dbl 

#1 21.0 160.0 2.620 16.46 
#2) 2430) 60h ON 2587.5) 17-402: 
#3) 2278) 108.082.3280) 18.611 
#4 21.4 258.0 3.215 19.44 
#5 18.7 360.0 3.440 17.02 
#6 18).1 225710) 3.460) 201,22 
# WITH 26 more ROWS 
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91 4.43 2.148 16.70 
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65) 45227113835 
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35-3) 77) 1513 
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. To select only mpg, disp, wt, qsec, and vs from mtcars_tbl: 


: notation can be used to select consecutive columns. To select columns from cars through disp and vs through 


carb: 


SELECT(mtcars_tbl, cars:disp. vs:c 
# A tibble: 32 x 8 

# cars mpg cyl 
# chr> <dbl> <dbl 
#1 Mazda RX4 21.0 6 
#2 Mazda RX4 Wag 21.0 6 
#3 Datsun 710 22.8 4 
#4 Hornet 4 Drive 21.4 6 
#5 Hornet Sportabout 18.7 8 
#6 Valiant 18.1 6 
# WITH 26 more ROWS 

or SELECT(mtcars_tbl hp:qsec 


arb 
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For datasets that contain several columns, it can be tedious to select several columns by name. To make life easier, 


there are a number of helper functions (such as starts_with(), ends_with(), contains(), matches(), 


num_range(), one_of(), and everything()) that can be used in SELECT. To learn more about how to use them, see 
?select_helpers and ?select 


Note: While referring to columns directly in SELECT 
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, we use bare column names, but quotes should be used while 
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referring to columns in helper functions. 
To rename columns while selecting: 


SELECT(mtcars_tbl, cylinders cyl, displacement disp) 


# A tibble: 32 x 2 
# cylinders displacement 


# dbl dbl 
#1 6 160.0 
#2 6 160.0 
#3 4 108.0 
#4 6 258.0 
#5 8 360.0 
#6 6 225.0 
# WITH 26 more ROWS 


As expected, this drops all other variables. 
To rename columns without dropping other variables, use rename: 


rename(mtcars_tbl, cylinders = cyl, displacement = disp) 


# A tibble: 32 x 12 


# cars mpg cylinders displacement hp drat wt qsec vs 
# <chr> <dbl> <db1> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 
#1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 2) 
#2 Mazda RX4 Wag 21.0 6 160.0 103.90 278715) diz OZ 2) 
#3 Datsun 710 22.8 4 108.0 OR Shes) PAS ye ileha oy 1 
#4 Hornet 4 Drive 21.4 6 258.0 MlOe 3.08 se 215 19F44 1 
#5 Hornet Sportabout 18.7 8 360.0 175 S15 3.440) i702. 2) 
#6 Valiant 18.1 6 225.0 105 2.76, 3-460) 20222 il 
# =... with 26 more rows, and 3 more variables: am <dbl>, gear <dbl>, carb <dbl> 

mutate 


mutate can be used to add new columns to the data. Like all other functions in dplyr, mutate doesn't add the newly 
created columns to the original data. Columns are added at the end of the data. frame. 


mutate(mtcars_tbl, weight_ton = wt/2, weight_pounds = weight_ton * 2000) 


# A tibble: 32 x 14 


# cars mpg cyl disp hp drat wt qsec vs am gear carb weight_ton 

weight_pounds 

# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <db1> <dbl> 
<db1> 

#1 Mazda RX4 21.0 6 160.0 1180 3.98 2.620 16.46 2) 1 4 4 1.3100 
2620 

#2 Mazda RX4 Wag 21.0 6 160R0" Will 3.90) 2.5875) di7 302 2) 1 4 4 124375 
2875 

#3 Datsun 710 22.8 4 108.0 93° 73). 85 27320) 11861 1 1 4 1 1.1600 
2320 

#4 Hornet 4 Drive 21.4 6)258.0) 9 116 3208: 322115) 19744 1 2) 3 1 1.6075 
3215 

#5 Hornet Sportabout 18.7 BES6O Oe ib =3e1 bes 440) 17.202 2) 2) 3 2 1.7200 
3440 

#6 Valiant 18.1 68225), Cau 105) 327-76) 3.4605 20022: 1 2) 3 1 1.7300 
3460 

# =... with 26 more rows 
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Note the use of weight_ton while creating weight_pounds. Unlike base R, mutate allows us to refer to columns that 
we just created to be used for a subsequent operation. 


To retain only the newly created columns, use transmute instead of mutate: 


transmute(mtcars_tbl, weight_ton = wt/2, weight_pounds = weight_ton * 2000) 


# A tibble: 32 x 2 
# = weight_ton weight_pounds 


# <dbl> <db1> 
#1 1.3100 2620 
#2 1.4375 2875 
#3 1.1600 2320 
#4 1.6075 3205 
#5 1.7200 3440 
#6 1.7300 3460 
# =... with 26 more rows 
summarise 


summarise calculates summary statistics of variables by collapsing multiple values to a single value. It can calculate 
multiple statistics and we can name these summary columns in the same statement. 


To calculate the mean and standard deviation of mpg and disp of all cars in the dataset: 


summarise(mtcars_tbl, mean_mpg = mean(mpg), sd_mpg = sd(mpg), 
mean_disp = mean(disp), sd_disp = sd(disp)) 


# A tibble: 1 x 4 
# mean_mpg sd_mpg mean_disp sd_disp 


# <db1> <db1> <db1> <db1> 
#1 20.098062 6.026948 230.7219 123.9387 
group_by 


group_by can be used to perform group wise operations on data. When the verbs defined above are applied on this 
grouped data, they are automatically applied to each group separately. 


To find mean and sd of mpg by cyl: 


by_cyl <- group_by(mtcars_tbl, cyl) 
summarise(by_cyl, mean_mpg = mean(mpg), sd_mpg = sd(mpg)) 


# A tibble: 3 x 3 
# cyl mean_mpg sd_mpg 
# <dbl> <db1> <db1l> 


#1 4 26.66364 4.509828 
#2 6 19.74286 1.453567 
#3 8 15.10000 2.560048 


Putting it all togther 


We select columns from cars through hp and gear, order the rows by cy1 and from highest to lowest mpg, group the 
data by gear, and finally subset only those cars have mpg > 20 and hp > 75 


selected SELECT(mtcars_tbl, cars:hp. gear 
ordered arrange(selected, cyl, DESC(mpg 
by_cyl group_by(ordered, gear 
FILTER(by_cyl. mpg 20, hp > 75 
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SOURCE: LOCAL DATA frame [9 x 6 
Groups: gear (3 


# cars mpg cyl disp hp gear 
# chr> <dbl> <dbl> <dbl> <dbl> <dbl 
#1 Lotus Europa 30.4 4 95.1 wile 5 
#2 Porsche 914-2 26.0 4 120.3 91 5 
#3 Datsun 710 22.8 4 108.0 93 4 
#4 Merc 230 22.8 4 140.8 95 4 
#5 Toyota Corona 21.5 4 120.1 97 3 
# WITH 4 more ROWS 


Maybe we are not interested the intermediate results, we can achieve the same result as above by wrapping the 


function calls: 


filter( 
group_by( 
arrange ( 
select ( 
mtcars_tbl, cars:hp 
), cyl, desc(mpg) 
), cyl 
),mpg > 20, hp > 75 


This can be a little difficult to read. So, dplyr operations can be chained using the pipe %>% operator. The above 


code transalates to: 


mtcars_tbl %>% 
select(cars:hp) %>% 
arrange(cyl, desc(mpg)) %>% 
group_by(cyl) %>% 
filter(mpg > 20, hp > 75) 


summarise multiple columns 
dplyr provides summarise_all1() to apply functions to all (non-grouping) columns. 
To find the number of distinct values for each column: 


mtcars_tbl %>% 
summarise_all(n_distinct) 


# A tibble: 1 x 12 


# cars mpg cyl disp hp drat wt qsec vs am gear carb 
A <int= <i1nt= <ints= <int= <int= <int= <int= <int= <int= <int= <int= <int 
#1 32 25 3 PLT} 22 DD: 29 30 2: 2 3 6 


To find the number of distinct values for each column by cyl: 


mtcars_tbl %>% 
group_by(cyl) %>% 
summarise_all(n_distinct) 


# A tibble: 3 x 12 


# cyl cars mpg disp hp drat wt qsec vs am gear carb 
# <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> 
#1 4 11 9 11 10 10 11 ‘ia pa 2: 3 2: 
#2 6 Wy 6 5 4 5 6 v 2 2: 3 3 
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#3 8 14 2 11 9 11 13 14 1 2 2. 4 


Note that we just had to add the group_by statement and the rest of the code is the same. The output now consists 
of three rows - one for each unique value of cyl. 


To summarise specific multiple columns, use summarise_at 


mtcars_tbl %>% 
group_by(cyl) %>% 
summarise_at(c("mpg", "disp", "hp"), mean) 


#ON tabbiles: 3) x 4 


# cyl mpg disp hp 
# <dbl> <db1> <dbl> <dbl> 
#1 4 26.66364 105.1364 82.63636 
#2 6 19)-74286 183.3143" 12228571 
#3 8 15.10000 353.1000 209.21429 


helper functions (?select_helpers) can be used in place of column names to select specific columns 
To apply multiple functions, either pass the function names as a character vector: 


mtcars_tbl %>% 
group_by(cyl) %>% 
summarise_at(c("mpg", "disp", "hp"), 
c("mean", “sdi)i) 


or wrap them inside funs: 


mtcars_tbl %>% 
group_by(cyl) %>% 
summarise_at(c("mpg", "disp", "hp"), 
funs(mean, sd)) 


# A tibble: 3 x 7 
# cyl mpg_mean disp_mean hp_mean mpg_sd disp_sd hp_sd 


# <dbl> <db1> <db1> <db1> <db1l> <db1> <db1> 
#1 4 26.66364 105.1364 82.63636 4.509828 26.87159 20.93453 
#2 6 19.74286 183.3143 122.28571 1.453567 41.56246 24.26049 
#3 8 15.10000 353.1000 209.21429 2.560048 67.77132 50.97689 


Column names are now be appended with function names to keep them distinct. In order to change this, pass the 
name to be appended with the function: 


mtcars_tbl %>% 
group_by(cyl) %>% 
summarise_at(c("mpg", "disp", "hp"), 
c(Mean = "mean", SD = "sd")) 


mtcars_tbl %>% 
group_by(cyl) %>% 
summarise_at(c("mpg", "disp", "hp"), 
funs(Mean = mean, SD = sd)) 


# A tibble: 3 x 7 

# cyl mpg_Mean disp_Mean hp_Mean mpg_SD disp_SD hp_SD 
# <dbl> <db1> <db1> <db1> <dbl> <db1> <dbl> 
#1 4 26.66364 105.1364 82.63636 4.509828 26.87159 20.93453 
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#2 6 19.74286 183.3143 122.28571 1.453567 41.56246 24.26049 
#3 8 15.10008 353.1008 209.21429 2.560048 67.77132 58.97689 


To select columns conditionally, use summarise_if: 


Take the mean of all columns that are numeric grouped by cyl: 


mtcars_tbl %>% 
group_by(cyl) %>% 
summarise_if(is.numeric, mean) 


#-A tabbiie: 3) x 11 


# cyl mpg disp hp drat wt qsec 
# = <dbl> <db1> <dbl> <dbl> <db1> <db1l> <db1> 
#1 4 26.66364 105.1364 82.63636 4.070989 2.285727 19.13727 
#2 65 19574286 183-3143) 12252857 3585714) 3 7 AS79774! 
#3 8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214 
# =... with 4 more variables: vs <dbl>, am <dbl>, gear <dbl>, 


# carb <dbl> 


However, some variables are discrete, and mean of these variables doesn't make sense. 


To take the mean of only continuous variables by cy1: 


mtcars_tbl %>% 
group_by(cyl) %>% 
summarise_if(function(x) is.numeric(x) & n_distinct(x) > 6, mean) 


da Ay talbbilles) 3) xX 7 


# cyl mpg disp hp drat wt qsec 
# = <dbl> <db1> <db1l> <db1l> <db1> <db1l> <db1> 
#1 4 26.66364 105.1364 82.63636 4.070909 2.285727 19.13727 
#2 67 1:9.574286) 18373943, 122528571) 3h 585714 Sey 14s 1797704 
#3 8 15.10008 353.1000 209.21429 3.229286 3.999214 16.77214 


Section 77.2: Aggregating with %>% (pipe) operator 


The pipe (%>%) operator could be used in combination with dplyr functions. In this example we use the mtcars 
dataset (see help("mtcars") for more information) to show how to sumarize a data frame, and to add variables to 
the data with the result of the application of a function. 


library(dplyr) 

library(magrittr) 

df <- mtcars 

dfScars <- rownames(df) #just add the cars names to the df 

df <- df[,c(ncol(df),1:(ncol(df)-1))] # and place the names in the first column 


1. Sumarize the data 


To compute statistics we use summarize and the appropriate functions. In this case n() is used for counting the 
number of cases. 


df %>% 
summarize(count=n(),mean_mpg = mean(mpg, na.rm = TRUE), 
min_weight = min(wt),max_weight = max(wt) ) 


# count mean_mpg min_weight max_weight 
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#1 32 20.09062 iis} 5.424 


2. Compute statistics by group 


It is possible to compute the statistics by groups of the data. In this case by Number of cylinders and Number of 


forward gears 


df %>% 
group_by(cyl, gear) %>% 
summarize(count=n(),mean_mpg = mean(mpg, na.rm = TRUE), 
min_weight = min(wt),max_weight = max(wt) ) 


# Source: local data frame [8 x 6] 
# Groups: cyl [?] 


# 

# cyl gear count mean_mpg min_weight max_weight 
# <dbl> <dbl> <int> <db1> <db1> <db1> 
#1 4 3 1 21.500 2.465 2.465 
#2 4 4 8 26.925 e6ds 3.190 
#3 4 5 2 28-206 12518 2.140 
#4 6 3 2 19.756 32215 3.460 
#5 6 4 4 19.750 2.620 3.440 
#6 6 5 1 19.700 PATA) 2.770 
#7 8 3 12> 15056 3/5435 5.424 
#8 8 5 2 15.400 3.170 3.570 


Section 77.3: Subset Observation (Rows) 


dplyr: :filter() - Select a subset of rows in a data frame that meet a logical criteria: 


dplyr::filter(iris, Sepal.Length>7) 


# Sepal.Length Sepal.Width Petal.Length Petal.Width 
# 1 TES 3.0 Bx!) Deal 
# 2 a6 3.0 6.6 Zell 
# 3 Tes 2.9 603 bet 
# 4 UP 3156 61 225 
# 5 Vial 358 is7/ DaZ 
# 6 Tad 26 6.9 223 
# 7 sal 258 Or, 2.0 
# 8 VP Bi2. 6.0 ist 
# 9 UP 3.0 5.8 16 
# 18 Te 228 | 1.9 
# 11 7-9 358 6.4 2.0 
# 2 Test 3.0 6 223 


dplyr: :distinct() - Remove duplicate rows: 
distinct(iris, Sepal.Length, .keep_all = TRUE) 


# Sepal.Length Sepal.Width Petal.Length Petal.Width 
# 1 Saal 35 1.4 Bie? 
# pe 4.9 3.0 ine! Q@.2 
# 3 Ae, Bee. es Oie2. 
# 4 4.6 Srl eS Bi2: 
# 5 5.0 3106 ed @.2 
# 6 5.4 329 Uey/ Q@.4 
# 7 4.4 259 ie! Q@.2 
# 8 4.8 3.4 6 @.2 
# 9 4.3 3.0 Vel Q.1 
# 10 5.8 4.0 le. Q@.2 

# #611 See) 4.4 ae Q@.4 

# 12 52 a5 125 O),2. 
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Species 
virginica 
virginica 
virginica 
virginica 
virginica 
virginica 
virginica 
virginica 
virginica 
virginica 
virginica 
virginica 


Species 
setosa 
setosa 
setosa 
setosa 
setosa 
setosa 
setosa 
setosa 
setosa 

setosa 


312 


# 13 5.5 Ane. AA: Q.2 setosa 
# 14 AS Zed 3 Q.3 setosa 
# tS Byes) Shed 5 Ore setosa 
# 16 7.0 one Ara: 1.4 versicolor 
# ey, 6.4 Sree. ASS 1.5 versicolor 
# 18 6.9 rill 4.9 1.5 versicolor 
# 19 655 278 4.6 1.5 versicolor 
# 20 63 Bhs) Ara 1.6 versicolor 
# 21 6.6 29 4.6 1.3 versicolor 
# PP) 59 Ria) Ae 2 1.5 versicolor 
# 23 6.0 De? 4.0 1.0 versicolor 
# 24 Oral PES) AV, 1.4 versicolor 
# 25 5.6 2.9 356 1.3 versicolor 
# 26 6.7 Sheil 4.4 1.4 versicolor 
# 27, OZ. Dee ARS 1.5 versicolor 
# 28 6.8 2.8 4.8 1.4 versicolor 
# 29 Troi 30 ye) 2.1 virginica 
# 30 Tex Seo) 6.6 2.1 virginica 
toil Tine 29 6,3 1.8 virginica 
# 32 Wee: 3.6 6.1 2.59 Vaiggunaca 
# 33 Woh 328 657 2.2 virginica 
# 34 7.4 228 6n0 1.9 virginica 
# 35 7.9 328 6.4 2.@ virginica 


Section 77.4: Examples of NSE and string variables in dpylr 


dplyr uses Non-Standard Evaluation(NSE), which is why we normally can use the variable names without quotes. 


However, sometimes during the data pipeline, we need to get our variable names from other sources such as a 
Shiny selection box. In case of functions like SELECT, we can just use select_ to use a string variable to select 


variable1l <- "Sepal.Length" 
variable2 <- "Sepal.Width" 

iris %>% 

select_(variable1, variable2) %>% 
head(n=5) 

Sepal.Length Sepal.Width 


# HH HH H 
wOowww wo 
AA NOD UW 


ak WN = 
ab sp Rw 
© OV SI) oO 


But if we want to use other features such as summarize or filter we need to use interp function from lazyeval 
package 


variable1l <- "Sepal.Length" 

variable2 <- "Sepal.Width" 

variable3 <- "Species" 

iris %>% 

select_(variable1, variable2, variable3) %>% 

group_by_(variable3) %>% 

summarize_(mean1l = lazyeval::interp(~mean(var), var = as.name(variable1)), mean2 = 
lazyeval: :interp(~mean(var), var = as.name(variable2) )) 


# Species mean1 mean2 
# <fetr> <dbl> <dbl> 
# 1 setosa 5.006 3.428 


# 2 versicolor 5.936 2.770 
#3 virginica 6.588 2.974 
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Chapter 78: caret 


caret is an R package that aids in data processing needed for machine learning problems. It stands for 


classification and regression training. When building models for a real dataset, there are some tasks other than the 


actual learning algorithm that need to be performed, such as cleaning the data, dealing with incomplete 
observations, validating our model on a test set, and compare different models. 


caret helps in these scenarios, independent of the actual learning algorithms used. 


Section 78.1: Preprocessing 


Pre-processing in caret is done through the preProcess() function. Given a matrix or data frame type object x, 
preProcess() applies transformations on the training data which can then be applied to testing data. 


The heart of the preProcess() function is the method argument. Method operations are applied in this order: 


. Zero-variance filter 

. Near-zero variance filter 

. Box-Cox/Yeo-Johnson/exponential transformation 
. Centering 

. Scaling 

Range 

Imputation 

PCA 

ICA 

. Spatial Sign 


SCO ANA UKRWD = 


= 


Below, we take the mtcars data set and perform centering, scaling, and a spatial sign transform. 


auto_index <- createDataPartition(mtcarsSmpg, p = .8, 
list = FALSE, 
times = 1) 


mt_train <- mtcars[auto_index, ] 
mt_test <- mtcars[-auto_index, ] 


process_mtcars <- preProcess(mt_train, method = c("“center", "scale", "spatialSign") ) 


mtcars_train_transf <- predict(process_mtcars, mt_train) 
mtcars_test_tranf <- predict(process_mtcars,mt_test) 
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Chapter 79: Extracting and Listing Files in 
Compressed Archives 


Section 79.1: Extracting files from a .zip archive 


Unzipping a zip archive is done with unzip function from the utils package (which is included in base R). 
unzip(zipfile = "bar.zip", exdir = "./foo") 


This will extract all files in "bar.zip" to the "foo" directory, which will be created if necessary. Tilde expansion is 
done automatically from your working directory. Alternatively, you can pass the whole path name to the zipfile. 
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—— 80: Probability Distributions with 


Section 80.1: PDF and PMF for different distributions in R 
PMF FOR THE BINOMIAL DISTRIBUTION 

Suppose that a fair die is rolled 10 times. What is the probability of throwing exactly two sixes? 

You can answer the question using the dbinom function: 


> dbinom(2, 10, 1/6) 
[1] @.29871 


PMF FOR THE POISSON DISTRIBUTION 


The number of sandwhich ordered in a restaurant on a given day is known to follow a Poisson distribution with a 
mean of 20. What is the probability that exactly eighteen sandwhich will be ordered tomorrow? 


You can answer the question with the dpois function: 


> dpois(18, 2@) 
[1] @.08439355 


PDF FOR THE NORMAL DISTRIBUTION 


To find the value of the pdf at x=2.5 for a normal distribution with a mean of 5 and a standard deviation of 2, use 
the command: 


> dnorm(2.5, mean=5, sd=2) 
[1] @.09132454 
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Chapter 81: R in LaTeX with knitr 


Option Details 
echo TRUE/FALSE) - whether to include R source code in the output file 


) 

TRUE/FALSE) 

warning (TRUE/FALSE) - whether to include warnings from the R source execution in the output file 
) 


( 
( - whether to include messages from the R source execution in the output file 
( 

error (TRUE/FALSE) - whether to include errors from the R source execution in the output file 
( 
( 
( 


message 


TRUE/FALSE) - whether to cache the results of the R source execution 
numeric) - width of the plot generated by the R source execution 


cache 
fig.width 
fig-height (numeric) - height of the plot generated by the R source execution 


Section 81.1: R in LaTeX with Knitr and Code Externalization 


Knitr is an R package that allows us to intermingle R code with LaTeX code. One way to achieve this is external code 
chunks. External code chunks allow us to develop/test R Scripts in an R development environment and then include 
the results in a report. It is a powerful organizational technique. This approach is demonstrated below. 


# r-noweb-file.Rnw 
\documentclass{article} 


<<echo=FALSE , cache=FALSE>>= 

knitr: :opts_chunk$set(echo=FALSE, cache=TRUE) 
knitr: :read_chunk('r-file.R’' ) 

@ 


\begin {document } 
This is an Rnw file (R noweb). It contains a combination of LateX and R. 


One we have called the read\_chunk command above we can reference sections of code in the r-file.R 
script. 


<<Chunk1>>= 


@ 
\end {document } 


When using this approach we keep our code in a separate R file as shown below. 


## r-file.R 
## note the specific comment style of a single pound sign followed by four dashes 


# ---- Chunk1 ---- 
print("This is R Code in an external file") 
x <- seq(1:10) 


y <- rev(seq(1:1@)) 
plot(x,y) 


Section 81.2: R in LaTeX with Knitr and Inline Code Chunks 


Knitr is an R package that allows us to intermingle R code with LaTeX code. One way to achieve this is inline code 
chunks. This apporach is demonstrated below. 


# r-noweb-file.Rnw 
\documentclass{article} 
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\begin {document } 
This is an Rnw file (R noweb). It contains a combination of LateX and R. 


<<my-label>>= 

print("This is an R Code Chunk") 
x <- seq(1:10) 

@ 


Above is an internal code chunk. 
We can access data created in any code chunk inline with our LaTeX code like this. 
The length of array x is \Sexpr{length(x)}. 


\end {document } 


Section 81.3: R in LaTex with Knitr and Internal Code Chunks 


Knitr is an R package that allows us to intermingle R code with LaTeX code. One way to achieve this is internal code 
chunks. This apporach is demonstrated below. 


# r-noweb-file.Rnw 

\documentclass{article} 

\begin{document } 

This is an Rnw file (R noweb). It contains a combination of LateX and R. 


<<code-chunk- label>>= 
print("This is an R Code Chunk") 
x <- seq(1:10) 

y <- seq(1:10) 

plot(x,y) # Brownian motion 

@ 


\end{document } 
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Chapter 82: Web Crawling in R 


Section 82.1: Standard scraping approach using the RCurl 
package 


We try to extract imdb top chart movies and ratings 


R> library(RCur1) 

R> library(XML) 

R> url <- “http://www.imdb.com/chart/top" 

R> top <- getURL(url) 

R> parsed_top <- htmlParse(top, encoding = "UTF-8") 
R> top_table <- readHTMLTable(parsed_top)[[1]] 

R> head(top_table[1:10, 1:3]) 


Rank & Title IMDb Rating 

1. The Shawshank Redemption (1994) 9.2 

The Godfather (1972) 9.2 

The Godfather: Part II (1974) 9.0 

The Dark Knight (2008) 8.9 

. Pulp Fiction (1994) 8.9 

The Good, the Bad and the Ugly (1966) 8.9 
Schindler’s List (1993) 8.9 

. 12 Angry Men (1957) 8.9 

. The Lord of the Rings: The Return of the King (2003) 8.9 
@ 10. Fight Club (1999) 8.8 


oOAN aunt wWhDh 


1 
Z 
3 
4 
5 
6 
7 
8 
9 
1 
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Chapter 83: Creating reports with 
RMarkdown 


Section 83.1: Including bibliographies 


A bibtex catalogue cna easily be included with the YAML option bibliography :. A certain style for the bibliography 
can be added with biblio-style:. The references are added at the end of the document. 


title: "Including Bibliography" 
author: "John Doe" 

output: pdf document 
bibliography: references.bib 


# Abstract 
@R_Core Team 2016 


# References 


Including Bibliogrpaly 


Abstract 


References 


Section 83.2: Including LaTeX Preample Commands 


There are two possible ways of including LaTeX preamble commands (e.g. \usepackage) in a RMarkdown 
document. 


1. Using the YAML option header-includes: 


title: "Including LaTeX Preample Commands in RMarkdown" 
header-includes: 

- \renewcommand{\familydefault}{cmss} 

- \usepackage[cm, slantedGreek] {sfmath} 

- \usepackage[T1] {fontenc} 

output: pdf document 


***{r setup, include=FALSE} 
knitr::opts_chunk$set(echo = TRUE, external=T) 


NAY 


# Section 1 


As you can see, this text uses the Computer Moden Font! 
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Including LaTeX Preample Commands in RMarkdown 


Section 1 


Ae 


2. Including External Commands with includes, in_header 


title: "Including LaTeX Preample Commands in RMarkdown" 
output: 

pdf document: 

includes: 

in_header: includes.tex 


*“**{r setup, include=FALSE} 
knitr::opts_chunk$set(echo = TRUE, external=T) 


# Section 1 


As you can see, this text uses the Computer Modern Font! 


Here, the content of includes.tex are the same three commands we included with header-includes. 


Writing a whole new template 


A possible third option is to write your own LaTex template and include it with template. But this covers a lot more 


of the structure than only the preamble. 


title: "My Template" 
author: "Martin Schmelzer" 
output: 

pdf document: 

template: myTemplate. tex 


Section 83.3: Printing tables 


There are several packages that allow the output of data structures in form of HTML or LaTeX tables. They mostly 


differ in flexibility. 
Here | use the packages: 


e knitr 
e xtable 
® pander 


For HTML documents 
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title: "Printing Tables" 
author: "Martin Schmelzer" 
date: "29 Juli 2016" 
output: html_document 


***{r setup, include=FALSE} 
knitr::opts_chunk$set(echo = TRUE) 
Library(knitr) 

library (xtable) 

Library (pander) 

df <- mtcars[1:4,1:4] 


# Print tables using ~kable~ 
“““{r, 'kable'} 
kable(df) 


VAY 


# Print tables using ~xtable° 
“““{r, 'xtable', results='asis'} 
print(xtable(df), type="html") 


VAY 


# Print tables using ~ pander’ 
“*“{r, 'pander'} 
pander (df) 


VAY 


Printing Tables 
Atarten Sorvneccer 


29 Ad DON 


Print tables using kable 


Print tables using xtable 


== Avi om fost 
monoass 
Print tables using pander 


For PDF documents 


title: "Printing Tables" 
author: "Martin Schmelzer" 
date: "29 Juli 2016" 
output: pdf document 


***{r setup, include=FALSE} 
knitr::opts_chunk$set(echo = TRUE) 
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Library(knitr) 
library(xtable) 
Library (pander) 

df <- mtcars[1:4,1:4] 


NAY 


# Print tables using ~kable~ 
“““{r, 'kable'} 
kable(df) 


NAY 


# Print tables using ~xtable° 
“““{r, 'xtable', results='asis'} 
print(xtable(df, caption="My Table") ) 


VAY 


# Print tables using ~ pander’ 
*“*“{r, 'pander'} 
pander (df) 


NAN 


Printing Tables 


Print table tring Rable 


Print tables wsing xteble 


Print table uning pander 


How can | stop xtable printing the comment ahead of each table? 


options(xtable.comment = FALSE) 


Section 83.4: Basic R-markdown document structure 


R-markdown code chunks 


R-markdown is a markdown file with embedded blocks of R code called chunks. There are two types of R code 


chunks: inline and block. 


Inline chunks are added using the following syntax: 


Mg, aa 


They are evaluated and inserted their output answer in place. 


Block chunks have a different syntax: 
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~**{r name, echo=TRUE, include=TRUE, ...} 


2*2 


And they come with several possible options. Here are the main ones (but there are many others): 


e echo (boolean) controls wether the code inside chunk will be included in the document 
e include (boolean) controls wether the output should be included in the document 

¢ fig.width (numeric) sets the width of the output figures 

e fig-height (numeric) sets the height of the output figures 

fig.cap (character) sets the figure captions 


They are written in a simple tag=value format like in the example above. 

R-markdown document example 

Below is a basic example of R-markdown file illustrating the way R code chunks are embedded inside r-markdown. 
# Title # 

This is **plain markdown** text. 

~**{r code, include=FALSE, echo=FALSE} 

# Just declare variables 


income <- 1000 
taxes <- 125 


My income is: “~r income ~ dollars and I payed ‘r taxes ~ dollars in taxes. 
Below is the sum of money I will have left: 

~“**{r gain, include=TRUE, echo=FALSE} 

gain <- income-taxes 


gain 


~“**{r plotOutput, include=TRUE, echo=FALSE, fig.width=6, fig.height=6} 


pie(c(income, taxes), label=c("income", "taxes")) 


Converting R-markdown to other formats 


The R knitr package can be used to evaluate R chunks inside R-markdown file and turn it into a regular markdown 
file. 


The following steps are needed in order to turn R-markdown file into pdf/htmI: 
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1. Convert R-markdown file to markdown file using knitr. 
2. Convert the obtained markdown file to pdf/html using specialized tools like pandoc. 


In addition to the above knitr package has wrapper functions knit2htm1() and knit2pdf() that can be used to 
produce the final document without the intermediate step of manually converting it to the markdown format: 


If the above example file was saved as income.Rmd it can be converted to a pdf file using the following R commands: 


library(knitr) 
knit2pdf("income.Rmd", "income.pdf") 


The final document will be similar to the one below. 


Title 


This is plain markdown text. 
My income is: 1000 dollars and I payed 125 dollars in taxes. 


Below is the sum of money I will have left: 


## (1) 875 


income . 
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Chapter 84: GPU-accelerated computing 
Section 84.1: gouR gpuMatrix objects 


library(gpuR) # gouMatrix objects X <- gouMatrix(rnorm(100), 10, 10) Y <- gouMatrix(rnorm(100), 10, 10) # transfer 
data to GPU when operation called # automatically copied back to CPU Z <- X %*% Y 


Section 84.2: gouR vciMatrix objects 


library(gpuR) # vclMatrix objects X <- vclMatrix(rnorm(100), 10, 10) Y <- vclMatrix(rnorm(100), 10, 10) # data always 
on GPU # no data transfer Z <- X %*% Y 
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Chapter 85: heatmap and heatmap.2 


Section 85.1: Examples from the official documentation 


stats::heatmap 
Example 1 (Basic usage) 


require(graphics) ; require(grDevices) 
x <- as.matrix(mtcars) 


rc <- rainbow(nrow(x), start = @, end = .3) 

cc <- rainbow(ncol(x), start = @, end = .3) 

hv <- heatmap(x, col = cm.colors(256), scale = "column", 
RowSideColors = rc, ColSideColors = cc, margins = c(5,1@), 
xlab = "specification variables", ylab = "Car Models", 
main = "heatmap(<Mtcars data>, ..., scale = \"column\")") 


heatmap(<Mtcars data>, ..., scale = "column") 


Merc 2800 
Mazds RX4 Weg 
Le ae Mazda FX4 
— t Merc 2400 
= Femari Dine 
Fist 128 
Fist x13 
Toyots Corollis 
Honda Cavic 
Mere 45054 
Mere 4505E 


Car Models 


—_ Camero 228 
Ford Pamters L 
Cadillisc Fleetwood 
= omer 
Chrysler imoers! 
‘aa = Maserst Sore 
oO a a 
eo a £ 
Ss €E 


Ea 
2 
3 


cyl 
m 


b 


= * 
Ts 


gear 
disp 


specification variables 


utils::str(hv) # the two re-ordering index vectors 

# List of 4 

S TFOWLMG: sania des2i ody lO 5525) 29 2457) Ona. 
Sucodindee cine wiz i|ie2e 98 oti eon Seal Oleyan Ae ee 

$ Rowv : NULL 

SuColvi = NULL 


Eos 


a 


Example 2 (no column dendrogram (nor reordering) at all) 

heatmap(x, Colv = NA, col = cm.colors(256), scale = "column", 
RowSideColors = rc, margins = c(5,18@), 
xlab = "specification variables", ylab = "Car Models", 
main = "heatmap(<Mtcars data>, ..., scale = \"column\")") 
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heatmap(<Mtcars = -., scale = "column") 


Porsche S142 
Dessun 710 
Volvo 142 
eS Mere 230 
Lots Euros 


as 
| ne 
z: = 
Mere 2800 
ay 
Example 3 ("no nothing") 


heatmap(x, Rowv = NA, Colv = NA, scale = "column", 
main = "heatmap(*, NA, NA) ~= image(t(x))") 


Car Models 


a a , = 
g@ ec = 
= a=] 


specification variables 
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heatmap(*, NA, NA) ~= image(t(x)) 


Volvo 142E 
Maserati Bors 
Ferrari Dino 
Ford Pantera L 
Lotus Europa 
Porsche 914-2 
Fiat X1-9 
Pontiac Firebird 
Camaro Z28 
AMC Javelin 


Datsun 710 
Mazda RX4 Wag 
Mazda RX4 


VS 
am 


i=.) = a a = 
a s Q = 2 = 
Ee 3 zo 


Example 4 (with reorder()) 


qsec 
gear 
carb 


round(Ca <- cor(attitude), 2) 


# rating complaints privileges learning raises critical advance 
# rating 1.00 @.83 @.43 B62) 50259 @.16 Q@.16 
# complaints 90.83 1.00 @.56 C607 40.567 @.19 O22. 
# privileges @.43 @.56 1.00 @.49 @.45 @.15 @.34 
# learning 8.62 @.60 @.49 1.00 0.64 @.12 @..53 
# raises @.59 Q.67 @.45 @.64 1.00 @.38 @.57 
# critical @.16 @.19 Oe 15 O23 8538 1.00 Q@.28 
# advance Q@.16 Q@.22 Q@.34 Bees = Gls ev/ Q@.28 1.00 


symnum(Ca) # simple graphic 


# rt cm pr irs cr Ja 

# rating 1 

# complaints + 1 

# privileges . . 1 

# learning acre: a 

# raises er eee ol 

# critical ral 

# advance ears 1 

# attr(, "legend" ) 

HPO) 20h. OG Oak: Geo) 48205 (Ba 
heatmap(Ca, symm = TRUE, margins = c(6,6)) 
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critical 


advance 


privileges 


rating 


complaints 


learning 


raises 


rating 


Oo 
o 
= 
pas 
o 


advance 
privileges 
omplaints 
learning 
raises 


" 
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Example 5 (NO reorder()) 
heatmap(Ca, Rowv = FALSE, symm = TRUE, margins = c(6,6)) 
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privileges 


rating 


complaints 


learning 


raises 


critical 


advance 


privileges 
rating 
learning 
raises 
critical 
advance 


wn 
— 
£ 
® 
a 
= 
° 


" 
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Example 6 (slightly artificial with color bar, without ordering) 


cc <- rainbow(nrow(Ca) ) 
heatmap(Ca, Rowv = FALSE, symm = TRUE, RowSideColors = cc, ColSideColors = cc, 


margins = c(6,6)) 
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privileges 


rating 


complaints 


learning 


raises 


critical 


advance 


ta 


privileges 
rating 
learning 
raises 
critical 
advance 


complaints 


Example 7 (slightly artificial with color bar, with ordering) 


heatmap(Ca, symm = TRUE, RowSideColors = cc, ColSideColors = cc, 
margins = c(6,6)) 
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"i 


critical 
advance 
privileges 
rating 
complaints 


leaming 


raises 


critical 


advance 


privileges 


rating 


complaints 


learning 


raises 


Example 8 (For variable clustering, rather use distance based on cor()) 


symnum( cU <- cor(USJudgeRatings) ) 
COL DMUDE CE (DE PR FO) W PHOR 


# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 


hU <- heatmap(cU, Rowv = FALSE, symm = TRUE, col = topo.colors(16), 
distfun = function(c) as.dist(1 - c), keep.dendro = TRUE) 


CONT 1 
INTG 
DMNR 
DILG 
CFMG 
DECI 
PREP 
FAMI 
ORAL 
WRIT 
PHYS ae 
RTEN * 


+e bette Da 
+e t+ tte ea 


attr(, "legend" ) 
Op ROSS gee: 


+uwvwwvwvw — 


* 


+e +tewet+eovowo-— 
~*~ ++ + + WS 
BOBtuoww -— 


CSO", (O58 O95 ke 18.957 1B: 
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CONT 


INTG 


INTG 
DMNR 
PHYS 
DILG 
CFMG 
DECI 
RTEN 
ORAL 
WRIT 
PREP 
AM 


— 
2 
Oo 
oO 


## The Correlation matrix with same reordering: 
round(1@@ * cU[hU[[1]], hU[[2]]]) 


Heo RHR HH HH HHH HH H 


CONT INTG DMNR PHYS DILG CFMG DECI RTEN ORAL WRIT PREP FAMI 
CONT 160) 13-15 3 1 14 9 =3 = -4 1 ao 
TINGS = =13 OO) ee 96) 874879 8 | 88 94 91 91 88 87 
DMNR -15 96 108 79 84 81 88 94 91 89 86 84 
PHYS 5 74 79 100 81 88a) 87 94 895 86 85: 84 
DILG 1 87 8481) COR 9696) 93-95 9679 98" 96 
CFMG 14 81 81 88> 96)) 160) 98) 93 955 9496-94 
DECI ONT SOL SO aS 70m 96 ee OS 00) 6 929 5 5 OO O4 
RTEN =e 94 94 91 93 O36 O27 1100 98 O75 4 
ORAL = 91 91 89955 7°95 97 95) 998 16075995) 98" o8 
WRIT -4 91 S986 9679 © 9495) 9979915 160" 99) 99 
BRER 1 S87 SOF es Some O Sn 96 ses OG 955m 93 OO OO 99 
FAMI =3 87 84 84 96 94 94 94 98 99 99 108 


## The column dendrogram: 
utils: :str(hUS$Colv) 


# 
# 


# RHR HH HH HH HH H 


--[dendrogram w/ 2 branches and 12 members at h = 1.15] 
|--leaf "CONT" 
~--[dendrogram w/ 2 branches and 11 members at h = 9.258] 


|--[dendrogram w/ 2 branches and 2 members at h = 0.0354] 

(lelear “iNne! 

| °--leaf "DMNR" 

*--[dendrogram w/ 2 branches and 9 members at h = 9.187] 
[==Leat “PHYS. 


~--[dendrogram w/ 2 branches and 8 members at h = 0.075] 


|--[dendrogram w/ 2 branches and 3 members at h = 0.0438] 


(ol steat “one 


~--[dendrogram w/ 2 branches and 2 members at h = 0.0189] 


| 
| |--leaf "CFMG" 
| “=-leaf “DECI” 
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# ~--[dendrogram w/ 2 branches and 5 members at h = 0.0584] 

# |--leaf "RTEN" 

# ~--[dendrogram w/ 2 branches and 4 members at h = 0.0187] 

# |--[dendrogram w/ 2 branches and 2 members at h = 0.00657] 
# | |--leaf "ORAL" 

# | °--leaf "WRIT" 

# *--[dendrogram w/ 2 branches and 2 members at h = 0.0101] 
# |--leaf "PREP" 

# ~--leaf "FAMI" 


Section 85.2: Tuning parameters in heatmap.2 


Given: 
x <- as.matrix(mtcars) 


One can use heatmap.2 - a more recent optimized version of heatmap, by loading the following library: 


require(gplots) 
heatmap .2(x) 


Color Key 


Court 


Toyota Coro 
Porsche 914 
Datsun 710 
Volvo 142E 
Merc 230 
Lotus Europ: 
Merc 260 
Merc 2600 
Mazda RX4 
Mazda RX4 
Merc 240D 
Ferrari Dino 
Fiat 128 

Fiat X1-9 
Toyota Cora 
Honda Civic 
Merc 450SL 
Merc 450SE 
Merc 450SU! 
Dedge Chall 
AMC Javelin 
Hormel 4 Dri 
Valiam 
Duster 360 
Camaro Z28 
Ford Pantere 
Pontiac Fired 
Hornet Sport 
Cadillac Flee 
Lincoln Cont 
Chrysler img 
Maserati Boe 


To add atitle, x- or y-label to your heatmap, you need to set the main, xlab and ylab: 


heatmap.2(x, main = "My main title: Overview of car features", xlab="Car features", ylab = "Car 
brands") 


If you wish to define your own color palette for your heatmap, you can set the col parameter by using the 
colorRampPalette function: 
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heatmap.2(x, trace="none", key=TRUE, Colv=FALSE, dendrogram 
colorRampPalette(c("darkblue", "white", "darkred"))(10@)) 


0 200 400 


S23 
vo 


As you can notice, the labels on the y axis (the car names) don't fit in the figure. In order to fix this, the user can 


tune the margins parameter: 


= "row",col = 


Toyota Cora 
Porsche 914 
Datsun 710 
Volvo 142E 
Merc 230 
Lotus Europ: 
Merc 260 
Mere 260C 
Mazda RX4 
Mazda RX4 
Merc 2400 
Ferrari Dino 
Fiat 128 

Fiat X1-9 
Toyota Cora 
Honda Civic 
Merc 450SL 
Merc 450SE 
Mere 450SU! 
Dodge Chall 
AMC Javelin 
Homet 4 Dri 
Valiam 
Duster 360 
Camaro Z28 
Ford Panters 
Pontiac Fire’ 
Hormet Sport 
Cadillac Flee 
Lincoln Cont 
Cheysler Imp 
Maserati Boe 


heatmap.2(x, trace="none", key=TRUE,col = colorRampPalette(c("darkblue", "white", "darkred"))(10@), 


margins=c(5,8)) 
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Color Key 


Court 
100 «200 


9] 


0 200 400 


e 


am 
hp 


disp 


> § = 


rat 
gear 
qsec 
mpg 


Toyota Corona 
Porsche 914-2 
Datsun 710 
Volvo 142E 
Mere 230 
Lotus Europa 
Mere 280 
Mere 280C 
Mazda RX4 Wag 
Mazda RX4 
Mere 2400 
Ferrari Dina 
Fiat 128 

Fiat X1-9 
Toyota Coralia 
Honda Civic 
Mere 450SL 
Mere 450SE 
Merc 450SLC 
Dodge Challenger 
AMC Javelin 


Cadillac Fleetwood 
Linceln Continental 
Chrysler imperial 
Maserati Bora 


Further, we can change the dimensions of each section of our heatmap (the key histogram, the dendograms and 


the heatmap itself), by tuning lhei and lwid: 
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Color Key 
0 200 ©6400 
Value 


Toyota Corona 
Porsche 914-2 
Datsun 710 

Valve 142E 

Mere 230 

Lotus Europa 
Mere 280 

Mere 2800 
Mazda RX4 Wag 
Mazda RX4 

Mere 240D 
Ferrari Dina 

Fiat 128 

Fiat X1-9 

Toyola Coralla 
Honda Civic 

Mere 450SL 

Merc 450SE 

Merc 450SLC 
Dodge Challenger 
AMC Javelin 
Homet 4 Drive 
Valiant 

Duster 360 
Camaro Z28 

Ford Pantera L 
Pontiac Firebird 
Homet Sportabout 
Cadillac Fleetwood 
Lincoln Continental 
Chrysler imperial 
Maserati Bora 


PETGEPELGE? 8 


If we only want to show a row(or column) dendogram, we need to set Colv=FALSE (or Rowv=FALSE) and adjust the 
dendogram parameter: 


heatmap.2(x, trace="none", key=TRUE, Colv=FALSE, dendrogram = "row", col = 
colorRampPalette(c("darkblue", "white", '"darkred"))(100), margins=c(5,8), lwid = ¢(5,15), lhei = 
c(3,15)) 
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Oo 200 400 
Value 


For changing the font size of the legend title, labels and axis, the user needs to set cex.main, 


the par list: 


par(cex.main=1, cex.lab=0.7, cex.axis=0.7) 
heatmap.2(x, trace="none", key=TRUE, Colv=FALSE, dendrogram = "row 


Toyota Corona 
Porsche 914-2 
Datsun 710 

Volvo 142E 

Mere 230 

Lotus Europa 
Mere 280 

Mere 2800 
Mazda RX4 Wag 
Mazda RX4 

Mere 240D 
Ferrari Dina 

Fiat 128 

Fiat X1-9 

Toyola Coralla 
Honda Civic 

Mere 450SL 

Merc 450SE 

Mere 450SLC 
Dodge Challenger 
AMC Javelin 
Homet 4 Drive 
Valiant 

Duster 360 
Camaro Z28 

Ford Pantera L 
Pontiac Firebird 
Homet Sportabout 
Cadillac Fleetwood 
Lincoln Continental 
Chrysler imperial 
Maserati Bora 


col = 


cex.lab, 


cex.axis in 


colorRampPalette(c("darkblue", "white", "darkred"))(100), margins=c(5,8), lwid = ¢(5,15), lhei = 


¢ (5.15) 
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Toyota Corona 
Porsche 914-2 
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Chapter 86: Network analysis with the 
igraph package 

Section 86.1: Simple Directed and Non-directed Network 
Graphing 


The igraph package for R is a wonderful tool that can be used to model networks, both real and virtual, with 
simplicity. This example is meant to demonstrate how to create two simple network graphs using the igraph 
package within R v.3.2.3. 


Non-Directed Network 
The network is created with this piece of code: 


g<-graph.formula(Node1-Node2, Node1-Node3, Node4-Node1) 
plot(g) 


> g<-graph. formula(Node1-Node2, Node1-Node3, Node4-Node1) 
> ; 


> 


N@ 


Directed Network 


dg<-graph.formula(Tom-+Mary, Tom-+Bill, Tom-+Sam, Sue+-Mary, Bill-+Sue) 
plot(dg) 


This code will then generate a network with arrows: 
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> dg<-graph.formula(Tom-+Mary, Tom-+Bill, Tom-+Sam, Sue+-Mary, Bill-+Sue) 
> ey 


> 


@ 


Code example of how to make a double sided arrow: 


dg<-graph.formula(Tom-+Mary, Tom-+Bill, Tom-+Sam, Sue+-Mary, Bill++Sue) 
plot(dg) 


dg<-graph.formula(Tom-+Mary, Tom-+Bill, Tom-+Sam, Sue+-Mary, Bill++Sue) 


—! 
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Chapter 87: Functional programming 


Section 87.1: Built-in Higher Order Functions 
R has a set of built in higher order functions: Map, Reduce, Filter, Find, Position, Negate. 
Map applies a given function to a list of values: 


words <- list("this", "is", "an", "example") 
Map(toupper, words) 


Reduce successively applies a binary function to a list of values in a recursive fashion. 
Reduce(***, 1:18) 


Filter given a predicate function and a list of values returns a filtered list containing only values for whom 
predicate function is TRUE. 


Filter(is.character, list(1,"a",2,"b",3,"c")) 
Find given a predicate function and a list of values returns the first value for which the predicate function is TRUE. 
Find(is.character, list(1,"a",2,"b",3,"c")) 


Position given a predicate function and a list of values returns the position of the first value in the list for which the 
predicate function is TRUE. 


Position(is.character, list(1,"a",2,"b",3,"c")) 
Negate inverts a predicate function making it return FALSE for values where it returned TRUE and vice versa. 


is.noncharacter <- Negate(is.character) 
is.noncharacter("a") 
is.noncharacter (mean) 
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Chapter 88: Get user input 
Section 88.1: User input in R 


Sometimes it can be interesting to have a cross-talk between the user and the program, one example being the 
swirl package that had been designed to teach R in R. 


One can ask for user input using the readline command: 
name <- readline(prompt = "What is your name?") 


The user can then give any answer, such as a number, a character, vectors, and scanning the result is here to make 
sure that the user has given a proper answer. For example: 


result <- readline(prompt = "What is the result of 1+1?") 
while(result! =2) { 

readline(prompt = "Wrong answer. What is the result of 1+1?") 
} 


However, it is to be noted that this code be stuck in a never-ending loop, as user input is saved as a character. 


We have to coerce it to a number, using as. numeric: 


result <- as.numeric(readline(prompt = "What is the result of 1+1?")) 
while(result! =2) { 

readline(prompt = "Wrong answer. What is the result of 1+1?") 
} 
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Chapter 89: Spark API (SparkR) 


Section 89.1: Setup Spark context 


Setup Spark context in R 


To start working with Sparks distributed dataframes, you must connect your R program with an existing Spark 
Cluster. 


library(SparkR) 
sc <- sparkR.init() # connection to Spark context 
sqlContext <- sparkRSQL.init(sc) # connection to SQL context 


Here are infos how to connect your IDE to a Spark cluster. 
Get Spark Cluster 


There is an Apache Spark introduction topic with install instructions. Basically, you can employ a Spark Cluster 
locally via java (See instructions) or use (non-free) cloud applications (e.g. Microsoft Azure [topic site], IBM). 


Section 89.2: Cache data 


What: 


Caching can optimize computation in Spark. Caching stores data in memory and is a special case of persistence. 
Here is explained what happens when you cache an RDD in Spark. 


Why: 


Basically, caching saves an interim partial result - usually after transformations - of your original data. So, when you 
use the cached RDD, the already transformed data from memory is accessed without recomputing the earlier 
transformations. 


How: 


Here is an example how to quickly access large data (here 3 GB big csv) from in-memory storage when accessing it 
more then once: 


library(SparkR) 

# next line is needed for direct csv import: 

Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.4.0" "sparkr- 
shell"') 


sc <- sparkR.init() 
sqlContext <- sparkRSQL.init(sc) 


# loading 3 GB big csv file: 


train <- read.df(sqlContext, "/train.csv", source = "com.databricks.spark.csv", inferSchema = 
“true") 
cache(train) 


system. time(head(train) ) 

# output: time elapsed: 125 s. This action invokes the caching at this point. 
system. time(head(train) ) 

# output: time elapsed: @.2 s (!!) 
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Section 89.3: Create RDDs (Resilient Distributed Datasets) 


From dataframe: 


mtrdd <- createDataFrame(sqlContext, mtcars) 


From csv: 


For csv's, you need to add the csv package to the environment before initiating the Spark context: 


Sys.setenv('SPARKR_SUBMIT_ARGS' ='"--packages" "com.databricks:spark-csv_2.10:1.4.0" "“sparkr- 
shell"') # context for csv import read csv -> 

sc <- sparkR.init() 

sqlContext <- sparkRSQL.init(sc) 


Then, you can load the csv either by infering the data schema of the data in the columns: 


train <- read.df(sqlContext, "/train.csv", header= "true", source = "com.databricks.spark.csv" 


inferSchema = "true") 
Or by specifying the data schema beforehand: 


customSchema <- structType( 


structField("margin", "“integer") 
structField("gross", "“integer"), 
structField("name", "“string")) 
train <- read.df(sqlContext, "/train.csv", header= "true", source = "com.databricks.spark.csv" 


schema = customSchema) 


GoalKicker.com - R Notes for Professionals 


346 


Chapter 90: Meta: Documentation 
Guidelines 


Section 90.1: Style 


Prompts 


If you want your code to be copy-pastable, remove prompts such as R=, >, or + at the beginning of each new line. 
Some Docs authors prefer to not make copy-pasting easy, and that is okay. 


Console output 


Console output should be clearly distinguished from code. Common approaches include: 


Include prompts on input (as seen when using the console). 

¢ Comment out all output, with # or ## starting each line. 

e Print as-is, trusting the leading [1] to make the output stand out from the input. 
e Add a blank line between code and console output. 


Assignment 


= and <- are fine for assigning R objects. Use white space appropriately to avoid writing code that is difficult to 
parse, such as x<-1 (ambiguous between x <- 1andx < -1) 


Code comments 


Be sure to explain the purpose and function of the code itself. There isn't any hard-and-fast rule on whether this 
explanation should be in prose or in code comments. Prose may be more readable and allows for longer 
explanations, but code comments make for easier copy-pasting. Keep both options in mind. 


Sections 


Many examples are short enough to not need sections, but if you use them, start with H1. 


Section 90.2: Making good examples 


Most of the guidance for creating good examples for Q&A carries over into the documentation. 


e Make it minimal and get to the point. Complications and digressions are counterproductive. 
e Include both working code and prose explaining it. Neither one is sufficient on its own. 


e Don't rely on external sources for data. Generate data or use the datasets library if possible: 


library(help = "datasets" ) 


There are some additional considerations in the context of Docs: 


e Refer to built-in docs like ?data. frame whenever relevant. The SO Docs are not an attempt to replace the 
built-in docs. It is important to make sure new R users know that the built-in docs exist as well as how to find 
them. 


e Move content that applies to multiple examples to the Remarks section. 
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Chapter 91: Input and output 


Section 91.1: Reading and writing data frames 

Data frames are R's tabular data structure. They can be written to or read from in a variety of ways. 
This example illustrates a couple common situations. See the links at the end for other resources. 
Writing 


Before making the example data below, make sure you're in a folder you want to write to. Run getwd() to verify the folder 
you're in and read ?setwd if you need to change folders. 


set.seed(1) 
for (i in 1:3) 
write. table( 
data.frame(id = 1:2, v = sample(letters, 2)), 
file = sprintf("file201%s.csv", i) 


) 


Now, we have three similarly-formatted CSV files on disk. 
Reading 


We have three similarly-formatted files (from the last section) to read in. Since these files are related, we should 
store them together after reading in, in a list: 


file_names = c("file2@11.csv", "file2@12.csv", "file2013.csv") 
file_contents = lapply(setNames(file_names, file_names), read.table) 


S$file2011.csv 
id v 


S$file2012.csv 


Tdiey 


TdiyV 


He oH RH HR HH HHH HH H HK 


To work with this list of files, first examine the structure with str(file_contents), then read about stacking the list 
with ?rbind or iterating over the list with ?lapply. 


Further resources 
Check out ?read. table and ?write. table to extend this example. Also: 


e R binary formats (for tables and other objects) 
e Plain-text table formats 

° comma-delimited CSVs 

© tab-delimited TSVs 
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o Fixed-width formats 
e Language-agnostic binary table formats 
o Feather 
e Foreign table and spreadsheet formats 
o SAS 
o SPSS 
o Stata 
° Excel 
Relational database table formats 
o MySQL 
° SQLite 
o PostgreSQL 
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Chapter 92: I/O for foreign tables (Excel, 
SAS, SPSS, Stata) 


Section 92.1: Importing data with rio 


Avery simple way to import data from many common file formats is with rio. This package provides a function 
import() that wraps many commonly used data import functions, thereby providing a standard interface. It works 
simply by passing a file name or URL to import(): 


import("example.csv" # comma-separated values 
import("example.tsv") # tab-separated values 
import("example.dta") # Stata 
import("example.sav" # SPSS 
import("example.sas7bdat") # SAS 
import("example.xlsx") # Excel 


import() can also read from compressed directories, URLs (HTTP or HTTPS), and the clipboard. A comprehensive 
list of all supported file formats is available on the rio package github repository. 


It is even possible to specify some further parameters related to the specific file format you are trying to read, 
passing them directly within the import() function: 


import("example.csv", format = ",") #for csv file where comma is used as separator 


import("example.csv", format = ";") #for csv file where semicolon is used as separator 


Section 92.2: Read and write Stata, SPSS and SAS files 


The packages foreign and haven can be used to import and export files from a variety of other statistical packages 
like Stata, SPSS and SAS and related software. There is a read function for each of the supported data types to 
import the files. 


# loading the packages 
library(foreign) 
library(haven) 
library(readstata13) 
library(Hmisc) 


Some examples for the most common data types: 


# reading Stata files with “foreign” 
read.dta("path\to\your\data") 

# reading Stata files with “haven 
read_dta("path\to\your\data") 


The foreign package can read in stata (.dta) files for versions of Stata 7-12. According to the development page, the 
read.dta is more or less frozen and will not be updated for reading in versions 13+. For more recent versions of 


Stata, you can use either the readstata13 package or haven. For readstata13, the files are 


# reading recent Stata (13+) files with ‘readstata13° 
read.dta13("path\to\your\data" ) 


For reading in SPSS and SAS files 
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# reading SPSS files with ~foreign’ 
read.spss("path\to\your\data.sav", to.data.frame = TRUE) 
# reading SPSS files with “haven” 
read_spss("path\to\your\data.sav" ) 
read_sav("path\to\your\data.sav") 
read_por("path\to\your\data.por") 


# reading SAS files with “foreign” 

read.ssd("path\to\your\data") 

# reading SAS files with ‘haven’ 

read_sas("path\to\your\data") 

# reading native SAS files with “Hmisc° 

sas.get("path\to\your\data") #requires access to saslib 

# Reading SA XPORT format ( *.XPT ) files 

sasxport.get("path\to\your\data.xpt") # does not require access to SAS executable 


The SAScii package provides functions that will accept SAS SET import code and construct a text file that can be 
processed with read. fwf. It has proved very robust for import of large public-released datasets. Support is at 


https://github.com/ajdamico/SAScii 


To export data frames to other statistical packages you can use the write functions write. foreign(). This will write 
2 files, one containing the data and one containing instructions the other package needs to read the data. 


# writing to Stata, SPSS or SAS files with “foreign’ 
write.foreign(dataframe, datafile, codefile, 
package==| c(“SPSS,, “Stata, SAS) 24.) 
write.foreign(dataframe, "“path\to\data\file", "path\to\instruction\file", package = "Stata") 


# writing to Stata files with “foreign” 
write.dta(dataframe, "file", version = 7L, 
convert.dates = TRUE, tz = "GMT", 
convert.factors = c("labels", "string", "numeric", "codes")) 


# writing to Stata files with “haven° 
write_dta(dataframe, "path\to\your\data" ) 


# writing to Stata files with “readstata13° 

save.dta13(dataframe, file, data.label = NULL, time.stamp = TRUE, 
convert.factors = TRUE, convert.dates = TRUE, tz = "GMT", 
add.rownames = FALSE, compress = FALSE, version = 117, 
convert.underscore = FALSE) 


# writing to SPSS files with ‘haven’ 
write_sav(dataframe, "path\to\your\data’" ) 


File stored by the SPSS can also be read with read. spss in this way: 


foreign: :read.spss('data.sav', to.data.frame=TRUE, use.value.labels=FALSE, 
use.missings=TRUE, reencode='UTF-8') 

# to.data.frame if TRUE: return a data frame 
# use.value.labels if TRUE: convert variables with value labels into R factors with those levels 
# use.missings if TRUE: information on user-defined missing values will used to set the 
corresponding values to NA. 
# reencode character strings will be re-encoded to the current locale. The default, NA, means to do 
so in a UTF-8 locale, only. 


Section 92.3: Importing Excel files 
There are several R packages to read excel files, each of which using different languages or resources, as 
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summarized in the following table: 


R package Uses 
xlsx Java 


XLconnect Java 


openxlsx C++ 


readxl C++ 
RODBC ODBC 
gdata Perl 


For the packages that use Java or ODBC it is important to know details about your system because you may have 
compatibility issues depending on your R version and OS. For instance, if you are using R 64 bits then you also must 
have Java 64 bits to use xlsx or XLconnect. 


Some examples of reading excel files with each package are provided below. Note that many of the packages have 
the same or very similar function names. Therefore, it is useful to state the package explicitly, like 
package: : function. The package openx1sx requires prior installation of RTools. 


Reading excel files with the xlsx package 


library(xlsx) 
The index or name of the sheet is required to import. 


xlsx: :read.xlsx("Book1.xlsx", sheetIndex=1) 


xlsx: :read.xlsx("Book1.xlsx", sheetName="Sheet1") 


Reading Excel files with the XLconnect package 


library(XLConnect ) 
wb <- XLConnect: : loadWorkbook("Book1.xlsx") 


# Either, if Book1.xlsx has a sheet called "Sheet1": 

sheet1 <- XLConnect::readWorksheet(wb, "Sheet1") 

# Or, more generally, just get the first sheet in Book1.xlsx: 
sheet1 <- XLConnect::readWorksheet(wb, getSheets(wb)[1]) 


XLConnect automatically imports the pre-defined Excel cell-styles embedded in Book1 .x1sx. This is useful when you 
wish to format your workbook object and export a perfectly formatted Excel document. Firstly, you will need to 
create the desired cell formats in Book1 .x1sx and save them, for example, as myHeader, myBody and myPcts. Then, 
after loading the workbook in R (see above): 


Headerstyle <- XLConnect::getCellStyle(wb, "“myHeader" ) 
Bodystyle <- XLConnect::getCellStyle(wb, "myBody") 
Pctsstyle <- XLConnect: :getCellStyle(wb, "myPcts") 


The cell styles are now saved in your R environment. In order to assign the cell styles to certain ranges of your data, 
you need to define the range and then assign the style: 


Headerrange <- expand.grid(row = 1, col = 1:8) 
Bodyrange <- expand.grid(row = 2:6, col = c(1:5, 8)) 
Pctrange <- expand.grid(row = 2:6, col = c(6, 7)) 


XLConnect: :setCellStyle(wb, sheet = "sheet1", row = HeaderrangeSrow, 


col = HeaderrangeScol, cellstyle = Headerstyle) 
XLConnect: :setCellStyle(wb, sheet = "“sheet1", row = BodyrangeSrow, 
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col = BodyrangeScol, cellstyle = Bodystyle) 
XLConnect: :setCellStyle(wb, sheet = "“sheet1", row = PctrangeSrow, 
col = PctrangeScol, cellstyle = Pctsstyle) 


Note that XLConnect is easy, but can become extremely slow in formatting. A much faster, but more cumbersome 
formatting option is offered by openx1sx. 


Reading excel files with the openxlisx package 
Excel files can be imported with package openx1sx 


library (openxlsx) 
openxlsx: :read.xlsx("spreadsheet1.xlsx", colNames=TRUE, rowNames=TRUE ) 


#colNames: If TRUE, the first row of data will be used as column names. 
#rowNames: If TRUE, first column of data will be used as row names. 


The sheet, which should be read into R can be selected either by providing its position in the sheet argument: 
openxlsx: :read.xlsx("spreadsheet1.xlsx", sheet = 1) 

or by declaring its name: 

openxlsx: :read.xlsx("spreadsheet1.xlsx", sheet = "Sheet1") 


Additionally, openxlsx can detect date columns in a read sheet. In order to allow automatic detection of dates, an 
argument detectDates should be set to TRUE: 


openxlsx: :read.xlsx("spreadsheet1.xlsx", sheet = "Sheet1", detectDates= TRUE) 


Reading excel files with the readxl package 

Excel files can be imported as a data frame into R using the readx1 package. 
library(readx1) 

It can read both .xls and .x1sx files. 


readxl1: :read_excel("spreadsheet1.xls" ) 
readxl1: :read_excel("spreadsheet2.xlsx") 


The sheet to be imported can be specified by number or name. 


readxl: :read_excel("spreadsheet.xls", sheet = 1) 
readxl: :read_excel("spreadsheet.xls", sheet = "summary") 


The argument col_names = TRUE Sets the first row as the column names. 
readxl: :read_excel("spreadsheet.xls", sheet = 1, col_names = TRUE) 
The argument col_types can be used to specify the column types in the data as a vector. 


readxl: :read_excel("spreadsheet.xls", sheet = 1, col_names = TRUE, 
col_types = c("text", "date", "numeric", "numeric" )) 


Reading excel files with the RODBC package 
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Excel files can be read using the ODBC Excel Driver that interfaces with Windows' Access Database Engine (ACE), 
formerly JET. With the RODBC package, R can connect to this driver and directly query workbooks. Worksheets are 
assumed to maintain column headers in first row with data in organized columns of similar types. NOTE: This 
approach is limited to only Windows/PC machines as JET/ACE are installed .dll files and not available on other 
operating systems. 


library(RODBC ) 


xlconn <- odbcDriverConnect('Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)}; 
DBQ=C :\\Path\\To\\Workbook.x1sx' ) 


df <- sqlQuery(xlconn, "SELECT * FROM [SheetNameS]") 
close(xlconn) 


Connecting with an SQL engine in this approach, Excel worksheets can be queried similar to database tables 
including JOIN and UNION operations. Syntax follows the JET/ACE SQL dialect. NOTE: Only data access DML 
statements, specifically SELECT can be run on workbooks, considered not updateable queries. 


joindf <- sqlQuery(xlconn, "SELECT t1.*, t2.* FROM [Sheet1$] t1 
INNER JOIN [Sheet2$] t2 
ONE tde EDI P= 302: [ Dies) 

uniondf <- sqlQuery(xlconn, "SELECT * FROM [Sheet1$] 


UNION 
SELECT * FROM [Sheet2$]") 


Even other workbooks can be queried from the same ODBC channel pointing to a current workbook: 


otherwkbkdf <- sqlQuery(xlconn, "SELECT * FROM 
[Excel 12.0 Xm1;HDR=Yes; 
Database=C :\\Path\\To\\Other \\Workbook.xlsx].[Sheet1$];") 


Reading excel files with the gdata package 


example here 


Section 92.4: Import or Export of Feather file 


Feather is an implementation of Apache Arrow designed to store data frames in a language agnostic manner while 
maintaining metadata (e.g. date classes), increasing interoperability between Python and R. Reading a feather file 
will produce a tibble, not a standard data.frame. 


library(feather) 


path <- "filename.feather" 
df <- mtcars 


write_feather(df, path) 


df2 <- read_feather (path) 


head(df2) 

## A tibble: 6 x 11 

## mpg cyl disp hp drat wt qsec vs am gear carb 
## = <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 
## 1° 21.0 6 1686 110 3.98 2.620 16.46 2) 1 4 4 
## 2 21.0 6 1686 110 3.98 2.875 17.02 7) 1 4 4 
## 3° 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 
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## 4 21.4 69-258) 1G) 13-08 
## 5 18.7 So S60 S175.) Ses 
## 6 18.1 Onn 225 lO See 2076 
head (df ) 

## mpg cyl disp 
## Mazda RX4 2120! 6 1160 
## Mazda RX4 Wag 21.06 6 160 
## Datsun 710 22.8 4 108 
## Hornet 4 Drive 21.4 67 258 
## Hornet Sportabout 18.7 8 360 
## Valiant 1864 6 225 


32159 1,9" 
3.440 17. 
3.460 20. 
hp drat 
118 3.90 
118 3.90 
93 3.85 
118 3.08 
AS She 
105 2276 


The current documentation contains this warning: 


Note to users: Feather should be treated as alpha software. In particular, the file format is likely to evolve 
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over the coming year. Do not use Feather for long-term data storage. 
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Chapter 93: |/O for database tables 
Section 93.1: Reading Data from MySQL Databases 


General 


Using the package RMySQL we can easily query MySQL as well as MariaDB databases and store the result in anR 
dataframe: 


library(RMySQL) 
mydb <- dbConnect(MySQL(), user='user', password='password', dbname='dbname',host='127.0.@.1' ) 
queryString <- "SELECT * FROM table1 t1 JOIN table2 t2 on t1.id=t2.id" 


query <- dbSendQuery(mydb, queryString) 
data <- fetch(query, n=-1) # n=-1 to return all results 


Using limits 


It is also possible to define a limit, e.g. getting only the first 100,000 rows. In order to do so, just change the SQL 
query regarding the desired limit. The mentioned package will consider these options. Example: 


queryString <- "SELECT * FROM table1 limit 100000" 


Section 93.2: Reading Data from MongoDB Databases 
In order to load data from a MongoDB database into an R dataframe, use the library MongoLite: 


# Use MongoLite library: 
#install.packages("mongolite" ) 
library(jsonlite) 
library(mongolite) 


# Connect to the database and the desired collection as root: 
db <- mongo(collection = "Tweets", db = "TweetCollector", url = 


“mongodb : //USERNAME : PASSWORD@HOSTNAME " ) 


# Read the desired documents i.e. Tweets inside one dataframe: 
documents <- dbSfind(limit = 100000, skip = @, fields = '{ "_id" : false, "Text" : true }') 


The code connects to the server HOSTNAME as USERNAME with PASSWORD, tries to open the database TweetCollector 
and read the collection Tweets. The query tries to read the field i.e. column Text. 


The results is a dataframe with columns as the yielded data set. In case of this example, the dataframe contains the 
column Text, e.g. documentsSText. 
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Chapter 94: 1/O for geographic data 
(shapefiles, etc.) 


See also Introduction to Geographical Maps and Input and Output 


Section 94.1: Import and Export Shapefiles 


With the rgdal package it is possible to import and export shapfiles with R. The function readOGR can be used to 
imports shapfiles. If you want to import a file from e.g. ArcGIS the first argument dsn is the path to the folder which 
contains the shapefile. layer is the name of the shapefile without the file ending (just map and not map. shp). 


library(rgdal) 
readOGR(dsn = "path\to\the\folder\containing\the\shapefile", layer = "map") 


To export a shapefile use thewriteOGR function. The first argument is the spatial object produced in R. dsn and 
layer are the same as above. The obligatory 4. argument is the driver used to generate the shapefile. The function 
ogrDrivers() lists all available drivers. If you want to export a shapfile to ArcGis or QGis you could use driver = 
"ESRI Shapefile”. 


writeOGR(Rmap, dsn = "path\to\the\folder\containing\the\shapefile", layer = "map", 
driver = "ESRI Shapefile" ) 


tmap package has a very convenient function read_shape(), which is a wrapper for rgdal: : reagOGR(). The 
read_shape() function simplifies the process of importing a shapefile a lot. On the downside, tmap is quite heavy. 
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Chapter 95: |/O for raster images 


See also Raster and Image Analysis and Input and Output 


Section 95.1: Load a multilayer raster 


The R-Logo is a multilayer raster file (red, green, blue) 


library(raster) 
r <- stack("C:/Program Files/R/R-3.2.3/doc/htm1/logo. jpg") 
plot(r) 


logo.1 logo.2 
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The individual layers of the RasterStack object can be addressed by [ [. 


plot(r[[1]]) 
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Chapter 96: |/O for R’s binary format 


Section 96.1: Rds and RData (Rdq) files 


.rds and .Rdata (also known as .rda) files can be used to store R objects in a format native to R. There are multiple 


advantages of saving this way when contrasted with non-native storage approaches, e.g. write. table: 


e Itis faster to restore the data toR 
e It keeps R specific information encoded in the data (e.g., attributes, variable types, etc). 


saveRDS/readRDS only handle a single R object. However, they are more flexible than the multi-object storage 


approach in that the object name of the restored object need not be the same as the object name when the object 


was stored. 


Using an .rds file, for example, saving the iris dataset we would use: 
saveRDS(object = iris, file = "my_data_frame.rds") 
To load it data back in: 

iris2 <- readRDS(file = "my_data_frame.rds") 

To save a multiple objects we can use save() and output as .Rdata. 
Example, to save 2 dataframes: iris and cars 

save(iris, cars, file = "myIrisAndCarsData.Rdata" ) 

To load: 


load("myIrisAndCarsData.Rdata" ) 


Section 96.2: Enviromments 


The functions save and load allow us to specify the environment where the object will be hosted: 


save(iris, cars, file = "myIrisAndCarsData.Rdata", envir = foo <- new.env()) 
load("myIrisAndCarsData.Rdata", envir = foo) 

fooScars 

save(iris, cars, file = "myIrisAndCarsData.Rdata", envir = foo <- new.env()) 
load("myIrisAndCarsData.Rdata", envir = foo) 

fooScars 
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Chapter 97: Recycling 


Section 97.1: Recycling use in subsetting 


Recycling can be used in a clever way to simplify code. 


Subsetting 


If we want to keep every third element of a vector we can do the following: 


my_vec <- C1 2334) 5676.9), 10) 
my_vec[c(TRUE, FALSE) ] 


TUN Ss es ee 2) 
Here the logical expression was expanded to the length of the vector. 
We can also perform comparisons using recycling: 


my_vec <-  c("foo", "bar" 
my_vec == "bar" 


F “soapl "mix") 


[1] FALSE TRUE FALSE FALSE 


Here "bar" gets recycled. 
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Chapter 98: Expression: parse + eval 


Section 98.1: Execute code in string format 
In this exemple, we want to execute code which is stored in a string format. 


# the string 
Str <> tl) 


# A string is not an expression. 
is.expression(str) 
[1] FALSE 


eval(str) 
[1] "1+1" 


# parse convert string into expressions 
parsed.str <- parse(text="1+1 oy 


is.expression(parsed.str) 
[1] TRUE 


eval(parsed.str) 
ile 
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Chapter 99: Regular Expression Syntax in R 


This document introduces the basics of regular expressions as used in R. For more information about R's regular 
expression syntax, see ?regex. For a comprehensive list of regular expression operators, see this ICU guide on 
regular expressions. 


Section 99.1: Use ‘grep to find a string in a character vector 


# General syntax: 
# grep(<pattern>, <character vector>) 


mystring <- c('The number 5', 
‘The number 8', 
‘1 is the loneliest number’, 
‘Company, 3 is', 
‘Git SSH tag is git@github.com', 
"My personal site is www.personal.org', 
‘path/to/my/file' ) 


grep('5', mystring) 


ce alee 

grep('@', mystring) 

# Las 

grep('number', mystring) 
Foul te 223 


x|y means look for "x" or "y" 


grep('5|8', mystring) 

ce ad) al 2 
grep('com|org', mystring) 
oe (dl Se 


. is a special character in Regex. It means "match any character" 


grep('The number .', mystring) 
ae al) 2 


Be careful when trying to match dots! 


tricky <- c('www.personal.org', 'My friend is a cyborg’) 
grep('.org', tricky) 
ca Wala) 


To match a literal character, you have to escape the string with a backslash (\). However, R tries to look for escape 
characters when creating strings, so you actually need to escape the backslash itself (i.e. you need to double escape 
regular expression characters.) 


grep('\.org', tricky) 

# Error: '\.' is an unrecognized escape in character string starting "'\. 
grep('\\.org', tricky) 

#1) 1 


If you want to match one of several characters, you can wrap those characters in brackets ([ ]) 


grep('[13]', mystring) 
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a il ey ee! 
grep('[@/]', mystring) 
ee (a 7 


It may be useful to indicate character sequences. E.g. [9-4] will match 0, 1, 2, 3, or 4, [A-Z] will match any 
uppercase letter, [A-z] will match any uppercase or lowercase letter, and [A-z@-9] will match any letter or number 


(i.e. all alphanumeric characters) 


grep('[@-4]', mystring) 
ae se4: 
grep('[A-Z]', mystring) 
Fo all e246 


R also has several shortcut classes that can be used in brackets. For instance, [ : lower: ] is short for a-z, [ :upper: ] 
is short for A-Z, [:alpha:] is A-z, [:digit:] is 9-9, and [ :alnum: ] is A-z@-9. Note that these whole expressions 
must be used inside brackets; for instance, to match a single digit, you can use [[ :digit:]] (note the double 
brackets). As another example, [@[ :digit:]/] will match the characters @, / or 8-9. 


grep('[[:digit:]]', mystring) 
oe a) i es eh! 
grep('[@[:digit:]/]', mystring) 
Fe lel ale ee ee 7, 


Brackets can also be used to negate a match with a carat (4). For instance, [*5] will match any character other than 
mo 


grep('The number [*5]', mystring) 
a lg) 22 
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Chapter 100: Regular Expressions (regex) 


Regular expressions (also called "regex" or "regexp") define patterns that can be matched against a string. Type 
?regex for the official R documentation and see the Regex Docs for more details. The most important ‘gotcha’ that 
will not be learned in the SO regex/topics is that most R-regex functions need the use of paired backslashes to 
escape in a pattern parameter. 


Section 100.1: Differences between Perl and POSIX regex 


There are two ever-so-slightly different engines of regular expressions implemented in R. The default is called 
POSIX-consistent; all regex functions in R are also equipped with an option to turn on the latter type: perl = TRUE. 


Look-ahead/look-behind 
perl = TRUE enables look-ahead and look-behind in regular expressions. 


e "(?<=A)B" matches an appearance of the letter B only if it's preceded by A, i.e. "ABACADABRA" would be 
matched, but "abacadabra" and "aBacadabra" would not. 


Section 100.2: Validate a date ina "YYYYMMDD" format 


It is acommon practice to name files using the date as prefix in the following format: YYYYMMDD, for example: 
20170101_results.csv. A date in such string format can be verified using the following regular expression: 


\\d{4}(0[1-9] |1[@12]) (@[1-9]| [12] [8-9] |3[@1]) 
The above expression considers dates from year: 8808-9999, months between: @1-12 and days 01-31. 
For example: 


> grepl("\\d{4}(0[1-9]|1[@12]) (@[1-9] | [12] [8-9]|3[@1])", "20170101") 


[1] TRUE 
> grepl("\\d{4}(8[1-9]|1[@12]) (@[1-9] |[12][@-9]|3[@1])", "20171206") 
[1] TRUE 
> grepl("\\d{4}(@[1-9]|1[@12]) (@[1-9] |[12][@-9]|3[@1])", "29991231") 
[1] TRUE 


Note: It validates the date syntax, but we can have a wrong date with a valid syntax, for example: 20178229 (2017 it 
is not a leap year). 


> grepl("\\d{4}(0[1-9]|1[@12]) (@[1-9] | [12] [9-9] |3[@1])", "20178229") 
[1] TRUE 


If you want to validate a date, it can be done via this user defined function: 
is.Date <- function(x) {return(!is.na(as.Date(as.character(x), format = '%Y%m%d')))} 
Then 


> is.Date(e("20178229", "20170101", 20170101) ) 
[1] FALSE TRUE TRUE 
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Section 100.3: Escaping characters in R regex patterns 


Since both R and regex share the escape character ,"\", building correct patterns for grep, sub, gsub or any other 
function that accepts a pattern argument will often need pairing of backslashes. If you build a three item character 
vector in which one items has a linefeed, another a tab character and one neither, and hte desire is to turn either 
the linefeed or the tab into 4-spaces then a single backslash is need for the construction, but tpaired backslashes 


for matching: 


x <- c( “a\nb", "c\td", "e fea) 
x # how it's stored 

se vaNnb: voNedy "e fi 
cat(x) # how it will be seen with cat 
#a 
#b oc de ii 


gsub(patt="\\n|\\t", repl=" " x) 
#[1] "a b” "e d" "@e f" 


Note that the pattern argument (which is optional if it appears first and only needs partial spelling) is the only 
argument to require this doubling or pairing. The replacement argument does not require the doubling of 
characters needing to be escaped. If you wanted all the linefeeds and 4-space occurrences replaces with tabs it 


would be: 


gsub("\\n| Oye LES 30) 
Fl | aNt be CNtde sent h 


Section 100.4: Validate US States postal abbreviations 


The following regex includes 50 states and also Commonwealth/Territory (see www.5Ostates.com): 


regex <- 
“(A[LKSZR] ) | (C[AOT]) | (D[EC]) | (F[ML]) | (G[AU]) | (HI) | (I[DLNA] ) | (K[SY] ) | (LA) | (M[EHDAINSOT] ) | (N[EVHJMYCD 
]) | (MP) | (O[HKR]) | (P[WAR] )| (RI) | (SECD]) | (TINX]) | (UT) | (VITIA]) | (WLAVIY])" 


For example: 
> test os eG (avAlnae Ase, "AR", aA CAS. "DG", "FM", GU RW: Gi eeu Adie GAR) 


> grepl(us.states.pattern, test) 
[1] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE 


> 


Note: 


If you want to verify only the 50 States, then we recommend to use the R-dataset: state.abb from state, for 
example: 


> data(state) 
> test %in% state.abb 
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE 


We get TRUE only for 50-States abbreviations: AL, AZ, AR, FL. 


Section 100.5: Validate US phone numbers 


The following regular expression: 
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us.phones.regex <- "A\\s*(\\+\\s*1(-?|\\st) )*[0-9] {3}\\s*-?\\s*[0-9] {3}\\s*-?\\s*[0-9] {4}S$" 


Validates a phone number in the form of: +1-xxx-xxx-xxxx, including optional leading/trailing blanks at the 
beginning/end of each group of numbers, but not in the middle, for example: +1-xxx-xxx-xx xx is not valid. The - 
delimiter can be replaced by blanks: xxx xxx xxx or without delimiter: xxxxxxxxxx. The +1 prefix is optional. 


Let's check it: 


us.phones.regex <- "A\\s*(\\+\\s*1(-?|\\st) )*[0-9] {3}\\s*-?\\s*[0-9] {3}\\s*-?\\s*[0-9] {4}S$" 


phones.OK <- e("3@5-123-4567", "305 123 4567", "+1-786-123-4567", 
etl 780 123 4567s sw eOl2s4567 Goa 22 Se 45G7a ee + dle7e6. lo) 4567.2) 


phones.NOK <- ¢("124-456-78901", "124-456-789", 


"124-456-78 90", 
"124-45 6-789", "12 4-456-7890") 


Valid cases: 


> grepl(us.phones.regex, phones.0OK) 
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE 


> 
Invalid cases: 


> grepl(us.phones.regex, phones.NOK) 
[1] FALSE FALSE FALSE FALSE FALSE 


> 
Note: 


e \\s Matches any space, tab or newline character 
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Chapter 101: Combinatorics 


Section 101.1: Enumerating combinations of a specified length 


Without replacement 


With combn, each vector appears in a column: 


combn(LETTERS, 3) 


# Showing only first 10. 

Pel 213) 4a Sie i671 181) fei (12) 
fie elles vse At eG ees AV Ao ee ACL Oueel AY A\ea AeaA 
(27a Bees ee Baeees Bier Boe wee eset. 
eel ee Se ee ee ee eRe Me ae eee 


With replacement 
With expand. grid, each vector appears in a row: 


expand. grid(LETTERS, LETTERS, LETTERS) 
# or 
do.call(expand.grid, rep(list(LETTERS), 3)) 


# Showing only first 10. 
Var1 Var2 Var3 


1 A A A 
2 B A A 
3 Cc A A 
4 D A A 
5 E A A 
6 F A A 
i G A A 
8 H A A 
9 I A A 
10 J A A 


For the special case of pairs, outer can be used, putting each vector into a cell: 


# FUN here is used as a function executed on each resulting pair. 
# in this case it's string concatenation. 
outer(LETTERS, LETTERS, FUN=paste@) 


# Showing only first 1@ rows and columns 
(a2 tat.) 3) (el 7) el tele 
(le GAA ABS VAC] SAD SAB AP AGH AH SoA VAg 
[231 @BAS BB EC. SED) SBE “Bee BGe Boies Bi. sBul: 
Poe GA ACB sCCis {CD XCE Crs CG A CHie siCiaec Jn 
(4 SDA {DB DG {DDS SDES {DE DG y | DHie Dis {Due 
Salh MEN ee aes Wap sao Mie see [snl ae ese 
(nll Ae SEEM Mae! MaDe MS? Viele Mees eile ioe vee 
[FAIS GAS GB GC! GDs GES “GE 4GG" GH Gi. Gu 
(ee RAC TAB cH HD Hee ne Gee Ae bee ese 
Fe aA eae aes e Caio idee Uae ane eee | Ed Pansy 
Da A EE er Pe er Bae eee AUR ie ale 
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Section 101.2: Counting combinations of a specified length 


Without replacement 


choose(length(LETTERS), 5) 
[1] 65780 


With replacement 


length(letters)%5 
[1] 11881376 
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Chapter 102: Solving ODEs in R 


Parameter Details 
y (named) numeric vector: the initial (state) values for the ODE system 


times — time sequence for which output is wanted; the first value of times must be the initial time 
func name of the function that computes the values of the derivatives in the ODE system 
parms (named) numeric vector: parameters passed to func 
method the integrator to use, by default: lsoda 


Section 102.1: The Lorenz model 


The Lorenz model describes the dynamics of three state variables, X, Y and Z. The model equations are: 


r et ] 
am ae re ~ 1+et 
4 et 1 
AR = e4+1 1+et 
P et 1 
a pod +] l+e- 


() = © 
oO = 
a e+ 1 l+e- 
(t) et 1 
oO — = 
ef +] l+et 
Library (deSolve) 
HES eae ete duane seo Se ses 5 he se Se 35 oe GaSe 2 eae oe Se see eee ee age ee ee eses dean eaeesee 
## Define R-function 
HE) sae eea yaa oe Gat Ae Gee sono end 6 oer aes ere see ataee Tee eee ce eeuas Seer aa uae 


Lorenz <- function (t, y, parms) { 
with(as.list(c(y, parms)), { 

dX <-a*X+Y*Z 

dY <- b * (Y¥ - Z) 

Zc =X tS Veta Ge Pay, 107 


return(list(c(dX, dY, dZ))) 


ee 
## Define parameters and variables 
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parms <- c(a = -8/3, b = -10, c = 28) 

yini <- c(X =1, Y=1, Z=1) 

times <- seq(from = 0, to = 100, by = 0.01) 

HER 08 SO 26 WSS Ae Sate UO REE ee OSU ES OUP HeES Ad eaters eee hese eee Seow asc sae 
UEP poo ean e 34g ce Sa cso ecidnd Go ake oh Sse 5 oS kee Goons s ese ehdg eon Geen aessc 
out <- ode(y = yini, times = times, func = Lorenz, parms = parms) 
ee eee 
CI Coes Re hm ales Aeon Ga oes ene ne ue a Re ele ea eee ale pie come a at la 
plot(out, lwd = 2) 

plot(out[,"X"], out[,"Y"l, 


type = "Ll", xlab = "X", 
ylab = "Y", main = "butterfly") 


40 


30 


10 20 


-10 O 


-20 


Section 102.2: Lotka-Volterra or: Prey vs. predator 


Library (deSolve) 
PES 2 Boe ae Gee eGo eae oes eee Be eS See ee Se ees eee ee eee eee eee sess 


HE os a 2 Noe eee Sass eee Oe Se eae eee teen ad acter Sa 
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LV <- function(t, y, parms) { 
with(as.list(c(y, parms)), { 


dP <- rG * P * (1 - P/K) - rl * P*C 
d¢ <- rl * Pp * € * AE - mM * C 


return(list(c(dP, dC), sum = C+P)) 


i) 

HH ------------- 2-2 ee ee er ee ee ee re re ee ee ee eee ee ee eee 
## Define parameters and variables 

HH ------------- 2-2 eee ee re ee ee ee ee eee ee ee ee --ee 
parms <- c(rI = 0.2, rG= 1.0, mM =0.2, AE = 0.5, K = 10) 

yini <- c(P = 1, C = 2) 

times <- seq(from = 0, to = 200, by = 1) 


UE os 6 sea So 65 eo ab Ganda dood Goa dee eb sr sod one dS one a Gee ehdgee an ees ese 
ee 
out <- ode(y = yini, times = times, func = LV, parms = parms) 

HH - oo ee rr rr rr rr rr re tr cere reece errr cree 
HH - oo er tr tt ert ce rece r ere e eee 
matplot(out[ ,1], out[ ,2:4], type = "Ll", xlab = "time", ylab = "Conc", 


main = "Lotka-Volterra", lwd = 2) 
legend("topright", c("prey", "predator", "sum"), col = 1:3, lty = 1:3) 


Lotka-Volterra 


—— prey 
predator 
sum 


Conc 


0 50 100 150 200 


time 
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Section 102.3: ODEs in compiled languages - definition in R 


Library (deSolve) 
(AS Boo ge Gases oe sc Se SsSce Saas Je oto SS. 5 G8 ol Se See eee aoe Seen oan eae eased 


He coe ccc ceee ces ce seen ee ceece soe reece nee see ace ceeds cme ee eee ese ee eee eee 


k <- M * eps%2/2 
L<- l 

LO <- 0.5 

r <- 0.1 

w <- 10 

g<-1 


parameter <- c(eps = eps, M=M, k=k, L=L, LO=LO, r=r, w=w, g = Q) 
yini <- c(xl = 0, yl = LO, xr =L, yr = LO, 
ul = -LO/L, vl = 0, 


ur = -LO/L, vr = 0, 
laml = 0, lam2 = 0) 


He eo Ine ee ae eee ee ae ae ee mae Ae eee eae ama Ae ese ac 
We sere pease onder eas se ee Sac esa wens s eee ee oases eee eee aes 


caraxis R <- function(t, y, parms) { 
with(as.list(c(y, parms)), { 


yb <- r * sin(w * t) 

xb <- sqrt(L * L - yb * yb) 

LL <- sqrt(xl*2 + yl%2) 

Lr <- sqrt((xr - xb)*2 + (yr - yb)%2) 


dxl <- ul; dyl <- vl; dxr <- ur; dyr <- vr 


dul <- (LO-LL) * xl/Ll +2 * lam2 * (xl-xr) + Lam1*xb 
dvl <- (LO@-LL) * yl/Ll +2 * lam2 * (yl-yr) + laml*yb - k * g 


dur <- (LO@-Lr) 
dvr <- (LO-Lr) 


(xr-xb)/Lr - 2 * lLam2 * (xl-xr) 
(yr-yb)/Lr - 2 * lam2 * (yl-yr) - k * g 


* 
* 
cl <- xb * xl + yb * yl 

c2. <- (xl - xr)*2 + (yl - yr)*2 -L*L 

return(list(c(dxl, dyl, dxr, dyr, dul, dvl, dur, dvr, cl, c2))) 


}) 
} 


Section 102.4: ODEs in compiled languages - definition in C 


sink("caraxis_C.c") 
cat(" 
/* suitable names for parameters and state variables */ 


#include <R.h> 
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#include < 


math.h> 


static double parms[8]; 


#define ep 
#define m 
#define k 
#define L 
#define L@ 
#define r 
#define w 
#define g 


s parms[@] 
parms[1] 
parms[2] 
parms[3] 
parms[4] 
parms[5] 
parms[6] 
parms[7] 


*/ 


void init_C(void (* daeparms)(int *, double *)) { 


int N = 


daeparms(&N, parms); 


} 
/* Compart 


#define xl 
#define yl 
#define xr 
#define yr 
#define la 
#define la 


8; 


ments */ 


y[@] 
y[1] 
y[2] 
y[3] 
m1 y[8] 
m2 y[9] 


*/ 


void caraxis_C (int *neq, double *t, double *y, double *ydot, 
double *yout, int* ip) 


+ + % + 


sqrt(L * L - yb * yb); 
Ssqru(xl * Xl + yl * yl): 
sqrt((xr-xb)*(xr-xb) + (yr-yb)*(yr-yb)); 


xe amiliexbi 
yl/Ll + lam1l*yb + 
(xr-xb)/Lr - 
(yr-yb)/Lr - 


xb dla Vibra yl 
(xl-xr) * (xl-xr) + (yl-yr) * (yl-yr) - L&L; 


{ 
double yb, xb, Lr, L1; 
yb = fr * sin(w * #t) : 
xb = 
(eee 
Lr = 
ydot[@] = y[4]; 
ydot[1] = y[5]; 
ydot[2] = y[6]; 
ydot[3] = y[7]; 
ydot[4] = (L@-L1) 
ydot[5] = (L@-L1) 
ydot[6] = (L@-Lr) 
ydot[7] = (L@-Lr) 
ydot[8] = 
ydot[9] = 

} 

"| 411 = TRUE) 

sink() 


system("R CMD SHLIB caraxis_C.c") 


dyn.load(paste("caraxis_C", 
dllname_C <- dyn.load(paste("caraxis_C", 
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2*lam2*(xl1-xr) ; 
2*lam2*(yl-yr) - k*g; 
2*lam2*(xl1-xr) ; 
2*lam2*(yl-yr) - kxg ; 


.PlatformSdynlib.ext, sep = "")) 


.PlatformSdynlib.ext, sep = ""))[[1]] 


374 


Section 102.5: ODEs in compiled languages - definition in 


fortran 


sink("caraxis_ fortran. f") 
cat(" 


subroutine init _fortran(daeparms) 


external daeparms 

integer, parameter :: N = 8 
double precision parms(N) 
common /myparms/parms 


call daeparms(N, parms) 
return 
end 


subroutine caraxis fortran(neq, t, y, ydot, out, ip) 
implicit none 

integer neq, IP(*) 

double precision t, y(neq), ydot(neq), out(*) 

double precision eps, M, k, L, LO, r, w, g 

common /myparms/ eps, M, k, L, LO, r, w, g 


double precision xl, yl, xr, yr, ul, vl, ur, vr, laml, lam2 
double precision yb, xb, Ll, Lr, dxl, dyl, dxr, dyr 
double precision dul, dvl, dur, dvr, cl, c2 


c expand state variables 
xl = y(1 
yay, 
xr = 


< 

a 

i 
i<=<==<<=™=< 


9) 
10) 


— 
fe)) 
3 
N 
i] 


yb = r * sin(w * t) 

xb = sqrt(L * L - yb * yb) 

LL = sqrt(xl**2 + yl**2) 

Lr = sqrt((xr - xb)**2 + (yr - yb)**2) 


dxl = ul 

dyl = vl 

dxr = ur 

dyr = vr 

dul = (LO-LL) * xl/Ll + 2 * Lam2 * (xl-xr) + Lam1*xb 

dvl = (LO-LL) * yl/Ll +2 * lam2 * (yl-yr) + laml*yb - k*g 
dur = (LO-Lr) * (xr-xb)/Lr - 2 * Lam2 * (xl-xr) 

dvr = (LQ-Lr) * (yr-yb)/Lr - 2 * lam2 * (yl-yr) - k*g 


cl = xb * xl + yb * yl 
c2 = (xl - xr)**2 + (yl - yr)**2 -L*L 
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eye) 


c function values in ydot 
ydot(1) = dxl 
ydot(2) = dyl 


( 
ydot(3) = dxr 
ydot(4) = dyr 
ydot(5) = dul 
ydot(6) = dvl 
ydot(7) = dur 
ydot(8) = dvr 
ydot(9) = cl 
ydot(10) = c2 
return 
end 
"| fill = TRUE) 
sink() 
system("R CMD SHLIB caraxis fortran. f") 
dyn. load(paste("caraxis fortran", .Platform$dynlib.ext, sep = "")) 
dllname fortran <- dyn.load(paste("caraxis fortran", .Platform$dynlib.ext, sep = ""))[[1]] 


Section 102.6: ODEs in compiled languages - a benchmark test 


When you compiled and loaded the code in the three examples before (ODEs in compiled languages - definition in 
R, ODEs in compiled languages - definition in C and ODEs in compiled languages - definition in fortran) you are able 


to run a benchmark test. 


library(microbenchmark) 
R <- function() { 


out <- ode(y = yini, times = times, func = caraxis_R, 
parms = parameter) 


C <- function() { 


out <- ode(y = yini, times = times, func = "“caraxis_C", 
initfunc = "init_C", parms = parameter, 
dllname = dllname_C) 
} 
fortran <- function() { 
out <- ode(y = yini, times = times, func = "caraxis_fortran", 
initfunc = “init_fortran", parms = parameter, 


dllname = dllname_fortran) 


Check if results are equal: 


all.equal(tail(R()), tail(fortran())) 
all.equal(R()[,2], fortran()[,2]) 
all.equal(R()[,2], C()[,2]) 


Make a benchmark (Note: On your machine the times are, of course, different): 


bench <- microbenchmark : :microbenchmark ( 


R(), 


fortran(), 


c(), 
times = 1000 


Goalkicker.com - R Notes for Professionals 


376 


) 


summary (bench) 
expr min lq mean median 
R() 31508.928 33651.541 36747 .8733 36062.2475 
fortran() 570.674 596.700 686.1084 637.4605 
C() 562-1163) 590E 377.) “673 .61245) 62570700 
Time [microseconds] 


uq max neval cld 

37546 .8025 132996.564 1000 »b 
730.1775 “4256.555" 1860! a 
723.8468 5914.347 1000 a 


16+03 1e+04 1e+05 


1 


We see clearly, that R is slow in contrast to the definition in C and fortran. For big models it's worth to translate the 
problem in a compiled language. The package cOde is one possibility to translate ODEs from R to C. 
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Chapter 103: Feature Selection in R -- 
Removing Extraneous Features 


Section 103.1: Removing features with zero or near-zero 
variance 
A feature that has near zero variance is a good candidate for removal. 


You can manually detect numerical variance below your own threshold: 


data("GermanCredit" ) 
variances<-apply(GermanCredit, 2, var) 
variances [which(variances<=@.90@25) ] 


Or, you can use the caret package to find near zero variance. An advantage here is that is defines near zero 
variance not in the numerical calculation of variance, but rather as a function of rarity: 


"nearZeroVar diagnoses predictors that have one unique value (i.e. are zero variance predictors) or 
predictors that are have both of the following characteristics: they have very few unique values relative to 
the number of samples and the ratio of the frequency of the most common value to the frequency of the 
second most common value is large..." 


library(caret) 
names(GermanCredit) [nearZeroVar(GermanCredit) ] 


Section 103.2: Removing features with high numbers of NA 
If a feature is largely lacking data, it is a good candidate for removal: 


library(VIM) 
data(sleep) 
colMeans(is.na(sleep) ) 


BodyWgt BrainWgt NonD Dream Sleep Span Gest 
@.00000008 8.00000000 8.22580645 8.19354839 8.06451613 @.06451613 8.06451613 
Pred Exp Danger 


@.080000008 8.80000000 8.00000000 


In this case, we may want to remove NonD and Dream, which each have around 20% missing values (your cutoff 
may vary) 


Section 103.3: Removing closely correlated features 


Closely correlated features may add variance to your model, and removing one of a correlated pair might help 
reduce that. There are lots of ways to detect correlation. Here's one: 


library(purrr) # in order to use keep() 


# select correlatable vars 
toCorrelate<-mtcars %>% keep(is.numeric) 


# calculate correlation matrix 
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correlationMatrix <- cor(toCorrelate) 


# pick only one out of each highly correlated pair's mirror image 
correlationMatrix|[upper.tri(correlationMatrix) ]<-@ 


# and I don't remove the highly-correlated-with-itself group 
diag(correlationMatrix)<-@ 


# find features that are highly correlated with another feature at the +- @.85 level 
apply(correlationMatrix,2, function(x) any(abs(x)>=0.85)) 


mpg cyl disp hp drat wt qsec vs am gear carb 
TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 


I'll want to look at what MPG is correlated to so strongly, and decide what to keep and what to toss. Same for cyl 
and disp. Alternatively, | might need to combine some strongly correlated features. 
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Chapter 104: Bibliography in RMD 


Parameter in YAML header Detail 

toc table of contents 

number_sections numbering the sections automatically 
bibliography path to the bibliography file 

csl path to the style file 


Section 104.1: Specifying a bibliography and cite authors 


The most important part of your RMD file is the YAML header. For writing an academic paper, | suggest to use PDF 
output, numbered sections and a table of content (toc). 


title: "Writing an academic paper in R" 
author: "Author" 

date: "Date" 

output: 

pdf document: 

number sections: yes 

toc: yes 

bibliography: bibliography.bib 


In this example, our file bibliography .bib looks like this: 


@ARTICLE {Meyer20@6, 
AUTHOR="Bernd Meyer", 
TITLE="A constraint-based framework for diagrammatic reasoning", 


JOURNAL="Applied Artificial Intelligence", 
VOLUME= "14", 

ISSUE = "4", 

PAGES= "327--344", 

YEAR=2000 


To cite an author mentioned in your .bib file write @ and the bibkey, e.g. Meyer2000. 


# Introduction 

~@Meyer2000° results in @Meyer2000. 

~@Meyer2000 [p. 328]°> results in @Meyer2000 [p. 328] 
~[@Meyer2000]~ results in [@Meyer200@] 
~[-@Meyer2000]°> results in [-@Meyer2000] 

# Summary 


# References 


Rendering the RMD file via RStudio (Ctrl+Shift+K) or via console rmarkdown: :render("<path-to-your-RMD-file">) 
results in the following output: 
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Writing an academic paper in R 


Author 


Date 


Contents 
1 Introduction 
2 Summary 


References 


1 Introduction 


@Meyer2000 results in Meyer (2000). 


@Meyer2000 [p. 328] results in Meyer (2000, 328) 


[@Meyer2000] results in (Meyer 2000) 
[-@Meyer2000] results in (2000) 


2 Summary 


References 


Meyer, Bernd. 2000. “A Constraint-Based Framework for Diagrammatic 


Intelligence 14 (4): 327-44. 


Section 104.2: Inline references 


Reasoning.” Applied Artificial 


If you have no *.bib file, you can use a references field in the document’s YAML metadata. This should include an 


array of YAML-encoded references, for example: 
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title: "Writing an academic paper in R" 
author: "Author" 

date: "Date" 

output: 

pdf document: 

number sections: yes 

toc: yes 

references: 

- id: Meyer2000 

title: A Constraint-Based Framework for Diagrammatic Reasoning 
author: 

- family: Meyer 

given: Bernd 

volume: 14 

issue: 4 

publisher: Applied Artificial Intelligence 
page: 327-344 

type: article-journal 

issued: 

year: 2000 


# Introduction 

“@Meyer2000° results in @Meyer2000. 

“@Meyer2000 [p. 328]> results in @Meyer2000 [p. 328] 
*[@Meyer2000]* results in [@Meyer2000] 
~[-@Meyer2000]* results in [-@Meyer2000] 

# Summary 


# References 


Rendering this file results in the same output as in example "Specifying a bibliography". 


Section 104.3: Citation styles 


By default, pandoc will use a Chicago author-date format for citations and references. To use another style, you will 
need to specify a CSL 1.0 style file in the cs! metadata field. In the following a often used citation style, the elsevier 
style, is presented (download at https://github.com/citation-style-language/styles ). The style-file has to be stored in 
the same directory as the RMD file OR the absolute path to the file has to be submitted. 


To use another style then the default one, the following code is used: 


title: "Writing an academic paper in R" 
author: "Author" 

date: "Date" 

output: 

pdf document: 

number sections: yes 

toc: yes 

bibliography: bibliography.bib 

csl: elsevier-harvard.csl 
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# Introduction 

“@Meyer2000° results in @Meyer2000. 

“@Meyer2000 [p. 328]> results in @Meyer2000 [p. 328] 
*[@Meyer2000]* results in [@Meyer2000] 
‘[-@Meyer2000]* results in [-@Meyer2000] 

# Summary 


# Reference 
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Writing an academic paper in R 


Author 
Date 
Contents 
1 Introduction 1 
2 Summary 1 
Reference 1 


1 Introduction 


@Meyer2000 results in Meyer (200%)). 

@Meyer2000 [p. 328] results in Meyer (2000, p. $28) 
[@Meyer 2000] results in (Meyer, 2000) 
[-®Meyer2000] results in (200%) 


2 Summary 


Reference 


Meyer, B., 2000. A constraint-based framework for diagrammatic reasoning. Applied Artificial Intelligence 
1d, 327-344. 


Notice the differences to the output of example "Specifying a bibliography and cite authors" 
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Chapter 105: Writing functions in R 


Section 105.1: Anonymous functions 


An anonymous function is, as the name implies, not assigned a name. This can be useful when the function is a part 
of a larger operation, but in itself does not take much place. One frequent use-case for anonymous functions is 
within the «apply family of Base functions. 


Calculate the root mean square for each column in a data. frame: 
df <- data. frame(first=5:9, second=(0:4)42, third=-1:3) 
apply(df, 2, function(x) { sqrt(sum(x‘2)) }) 


first second third 
15.968719 18.814888 3.872983 


Create a sequence of step-length one from the smallest to the largest value for each row in a matrix. 


x <- sample(1:6, 12, replace=TRUE) 
mat <- matrix(x, nrow=3) 


apply(mat, 1, function(x) { seq(min(x), max(x)) }) 
An anonymous function can also stand on its own: 


(function() { 1 })() 
[1] 1 


is equivalent to 
f <- function() { 1 }) 


WO) 
ee 


Section 105.2: RStudio code snippets 


This is just a small hack for those who use self-defined functions often. 
Type "fun" RStudio IDE and hit TAB. 


pest) | ${1:name} <- function (${2:variables}) { 
${3: code} 
} 


©& function {base} 


& functionBody {methods} 
& functionBody<- = {methods} 


CON OUP WNP 


The result will be a skeleton of a new function. 


name <- function(variables) { 
} 


One can easily define their own snippet template, i.e. like the one below 
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name <- function(df, x, y) { 
require(tidyverse) 
out <- 
return(out) 


The option is Edit Snippets in the Global Options -> Code menu. 


Section 105.3: Named functions 


R is full of functions, it is after all a functional programming language, but sometimes the precise function you need 
isn't provided in the Base resources. You could conceivably install a package containing the function, but maybe 
your requirements are just so specific that no pre-made function fits the bill? Then you're left with the option of 
making your own. 


A function can be very simple, to the point of being being pretty much pointless. It doesn't even need to take an 
argument: 


one <- function() { 1 } 
one() 


[1] 1 


two <- function() { 1 +1 } 
two() 
fae 


What's between the curly braces { } is the function proper. As long as you can fit everything on a single line they 
aren't strictly needed, but can be useful to keep things organized. 


A function can be very simple, yet highly specific. This function takes as input a vector (vec in this example) and 
outputs the same vector with the vector's length (6 in this case) subtracted from each of the vector's elements. 


vec <- 4:9 

subtract.length <- function(x) { x - length(x) } 
subtract. length(vec) 

[A =2 1 6 1 23 


Notice that length() is in itself a pre-supplied (i.e. Base) function. You can of course use a previously self-made 
function within another self-made function, as well as assign variables and perform other operations while 
spanning several lines: 


vec2 <- (4:7)/2 


msdf <- function(x, multiplier=4) { 
mult <- x * multiplier 
subl <- subtract. length(x) 
data. frame(mult, subl) 


} 

msdf(vec2, 5) 
mult subl 

11618) c= 20 

2a h2 2 Selo 

Sp ier Gh SiG) 

Ay. -Ono 


multiplier=4 makes sure that 4 is the default value of the argument multiplier, if no value is given when calling 
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the function 4 is what will be used. 


The above are all examples of named functions, so called simply because they have been given names (one, two, 
subtract .length etc.) 
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Chapter 106: Color schemes for graphics 


Section 106.1: viridis - print and colorblind friendly palettes 


Viridis (named after the chromis viridis fish) is a recently developed color scheme for the Python library matplotlib 
(the video presentation by the link explains how the color scheme was developed and what are its main 
advantages). It is seamlessly ported to R. 


There are 4 variants of color schemes: magma, plasma, inferno, and viridis (default). They are chosen with the 
option parameter and are coded as A, B, C, and D, correspondingly. To have an impression of the 4 color schemes, 
look at the maps: 


US unemployment rate by county 


option A aka 'magma’ option B aka ‘inferno’ 


option C aka ‘plasma’ option D aka ‘viridis’ 


(image souce) 


The package can be installed from CRAN or github. 
The vignette for viridis package is just brilliant. 


Nice feature of the viridis color scheme is integration with ggplot2. Within the package two ggplot2-specific 
functions are defined: scale_color_viridis() and scale_fill_viridis(). See the example below: 


library(viridis) 
library(ggplot2) 


gg1 <- ggplot(mtcars)+ 
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geom_point(aes(x = mpg, y = hp, color = disp), size = 3)+ 
scale_color_viridis(option = "B")+ 

theme_minimal()+ 

theme(legend.position = c(.8, .8)) 


gg2 <- ggplot(mtcars)+ 
geom_violin(aes(x = factor(cyl), y = hp, fill = factor(cyl)))+ 
scale_fill_viridis(discrete = T)+ 
theme_minimal()+ 


theme(legend.position = ‘none') 
library(cowplot) 
output <- plot_grid(gg1,gg2, labels = c('B','D'),label_size = 20) 
print(output) 


B @ disp D 
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Section 106.2: A handy function to glimse a vector of colors 


Quite often there is a need to glimpse the chosen color palette. One elegant solution is the following self defined 
function: 


color_glimpse <- function(colors_string) { 
n <- length(colors_string) 
hist(1:n, breaks=@:n,col=colors_string) 


An example of use 


color_glimpse(blues9) 
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Histogram of 1:n 


Frequency 
0.6 0.8 1.0 


0.4 


0.2 


tin 


Section 106.3: colorspace - click&drag interface for colors 


The package colorspace provides GUI for selecting a palette. On the call of choose_palette() function the 
following window pops-up: 
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Choose Color Palette 


File 


The nature of your data |Diverging wl 
Default color schemes 


Palette description: Hue, Chroma, Luminance, Power 
H1 | Et 260 
H2 = 0 


® Correct all colors to valid RGB color model values 


Number of colors in palette 
7. a 7 


Show example 


Plot type! ~| | Reverse colors 
| Desaturation [ Color blindness: @ deutan ~~ protan ~ tritan 


OK | Cancel | 


When the palette is chosen, just hit OK and do not forget to store the output in a variable, e.g. pal. 


pal <- choose_palette() 


The output is a function that takes n (number) as input and produces a color vector of length n according to the 
selected palette. 


pal(1@) 


[1] "#@23FA5" "#6371AF" "#959CC3" "#BEC1D4" "#DBDCE@" "#E@DBDC" "#D6BCCO" "#C6909A" "#AE5A6D" 
"#8E063B" 


Section 106.4: Colorblind-friendly palettes 


Even though colorblind people can recognize a wide range of colors, it might be hard to differentiate between 
certain colors. 


RColorBrewer provides colorblind-friendly palettes: 
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library(RColorBrewer) 
display.brewer.all(colorblindFriendly = T) 


YlOrRd i 
YlOrBr a 
YIGnBu EE 
YiGn CC 
Reds ee | ll 
RdPu SS 
Purples cc 
PuRd 
PuBuGn tT 
PuBu | 
OrRd EEE 
Oranges ee Te el 
Greys cc 
Greens a SEL 
GnBu EEE 
BuPu i LL TT 
BuGn cc LL TT 
Blues SL 
Sct? i Ells 
Paired | EE —=FaM OE OEE oO 
Dak? i 
Yb Se Ei 
ga iPS a Ee 
LO i Ei 
PRG) i Ee 
PYG i i a EE 
536 ii i Ee 


The Color Universal Design from the University of Tokyo proposes the following palettes: 


#palette using grey 
cbPalette <- c("#999999", "#F69FQ@0", "#56B4E9", "#009E73", "#FQE442", "#0072B2", "#D55E@0", 
"#CC79A7") 


#palette using black 
cbbPalette <- c("#@00000", "#E69FQ0", "#56B4E9", "#009E73", "#FQE442", "#0072B2", “#D55E@0", 
"#CC79A7" ) 


Section 106.5: RColorBrewer 


ColorBrewer project is a very popular tool to select harmoniously matching color palettes. RColorBrewer is a port of 
the project for R and provides also colorblind-friendly palettes. 


An example of use 


colors_vec <- brewer.pal(5, name = 'BrBG') 
print (colors_vec) 
[1] “#A6611A" "#DFC27D" "“#F5F5F5" "#8@CDC1" "#018571" 


RColorBrewer creates coloring options for ggplot2: scale_color_brewer and scale_fill_brewer. 


library(ggplot2) 
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ggplot(mtcars)+ 
geom_point(aes(x = mpg, y = hp, color = factor(cyl)), size = 3)+ 
scale_color_brewer(palette = ‘Greens’ )+ 
theme_minimal()+ 
theme(legend.position = c(.8,.8)) 


@ factor(cyl) 
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Section 106.6: basic R color functions 


Function colors() lists all the color names that are recognized by R. There is a nice PDF where one can actually see 
those colors. 


colorRampPalette creates a function that interpolate a set of given colors to create new color palettes. This output 
function takes n (number) as input and produces a color vector of length n interpolating the initial colors. 


pal <- colorRampPalette(c('white','red')) 
pal(5) 
[1] "“#FFFFFF" "#FFBFBF" "#FF7F7F" "“#FF3F3F" “#FFe@00" 


Any specific color may be produced with an rgb() function: 
rgb(@,1,0) 


produces green color. 
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Chapter 107: Hierarchical clustering with 
hclust 


The stats package provides the hclust function to perform hierarchical clustering. 


Section 107.1: Example 1 - Basic use of hclust, display of 
dendrogram, plot clusters 


The cluster library contains the ruspini data - a standard set of data for illustrating cluster analysis. 


library(cluster) ## to get the ruspini data 
plot(ruspini, asp=1, pch=20) ## take a look at the data 


150 


100 


50 


0 50 100 150 


hclust expects a distance matrix, not the original data. We compute the tree using the default parameters and 
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display it. The hang parameter lines up all of the leaves of the tree along the baseline. 


ruspini_hc_defaults <- hclust(dist(ruspini) ) 

dend <- as.dendrogram(ruspini_hc_defaults) 

if(!require(dendextend)) install.packages("dendextend"); library(dendextend) 
dend <- color_branches(dend, k = 4) 

plot(dend) 


150 


S 
B 
——————. 
— 
= oA Stale rad} 


Cut the tree to give four clusters and replot the data coloring the points by cluster. k is the desired number of 
clusters. 


rhc_def_4 = cutree(ruspini_hc_defaults, k=4) 
plot(ruspini, pch=20@, asp=1, col=rhc_def_4) 
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This clustering is a little odd. We can get a better clustering by scaling the data first. 


scaled_ruspini_hc_defaults = hclust(dist(scale(ruspini) ) ) 
srhc_def_4 = cutree(scaled_ruspini_hc_defaults, 4) 
plot(ruspini, pch=20@, asp=1, col=srhc_def_4) 
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The default dissimilarity measure for comparing clusters is "complete". You can specify a different measure with 
the method parameter. 


ruspini_hc_single = hclust(dist(ruspini), method="single") 


Section 107.2: Example 2 - hclust and outliers 


With hierarchical clustering, outliers often show up as one-point clusters. 


Generate three Gaussian distributions to illustrate the effect of outliers. 


set.seed(656) 

x = e(rnorm(15@, 8, 1), rnorm(150,9,1), rnorm(150,4.5,1)) 
y = c(rnorm(150, @, 1), rnorm(150,0,1), rnorm(150,5,1)) 
XYdf = data.frame(x,y) 

plot(XYdf, pch=2@) 
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-2 0 2 a 6 & 10 12 


Build the cluster structure, split it into three cluster. 


XY_sing = hclust(dist(XYdf), method="single") 
XYs3 = cutree(XY_sing,k=3) 


table(XYs3) 
XYs3 

1 2 3 
448 1 1 


hclust found two outliers and put everything else into one big cluster. To get the "real" clusters, you may need to set 
k higher. 


XYs6 = cutree(XY_sing,k=6) 
table(XYs6) 
XYs6 

1 Z 3 4 5 6 
148 150 1 149 1 1 
plot(XYdf, pch=28, col=XYs6) 
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-2 0 2 a 6 & 10 12 


This StackOverflow post has some guidance on how to pick the number of clusters, but be aware of this behavior in 
hierarchical clustering. 
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Chapter 108: Random Forest Algorithm 


RandomForest is an ensemble method for classification or regression that reduces the chance of overfitting the 
data. Details of the method can be found in the Wikipedia article on Random Forests. The main implementation for 
Ris in the randomForest package, but there are other implementations. See the CRAN view on Machine Learning. 


Section 108.1: Basic examples - Classification and Regression 


###### Used for both Classification and Regression examples 
library(randomForest) 

library(car) ## For the Soils data 

data(Soils) 


PREP REAEE EEE EAS HEEAEAGE BE ADAR SATA HAE BEATE E SEE 
## RF Classification Example 

set.seed(656) ## for reproducibility 
S_RF_Class = randomForest(Gp ~ ., data=Soils[,c(4,6:14)]) 
Gp_RF = predict(S_RF_Class, Soils[,6:14]) 
length(which(Gp_RF != SoilsSGp) ) ## No Errors 


## Naive Bayes for comparison 

library(e1071) 

S_NB_ = naiveBayes(Soils[,6:14], Soils[,4]) 

Gp_NB = predict(S_NB, Soils[,6:14], type="class") 
length(which(Gp_NB != SoilsSGp) ) ## 6 Errors 


This example tested on the training data, but illustrates that RF can make very good models. 


PREP REAES EEE EAS HEEAETEE EAE AE OATH EG EH AE BEATE SEE 
## RF Regression Example 

set.seed(656) ## for reproducibility 
S_RF_Reg = randomForest(pH ~ ., data=Soils[,6:14]) 
pH_RF = predict(S_RF_Reg, Soils[,6:14]) 


## Compare Predictions with Actual values for RF and Linear Model 
S_LM = Im(pH ~ ., data=Soils[,6:14]) 

pH_LM = predict(S_LM, Soils[,6:14]) 

par(mfrow=c(1,2)) 

plot(Soils$pH, pH_RF, pch=20, ylab="Predicted", main="Random Forest") 
abline(@, 1) 

plot(Soils$pH, pH_LM, pch=2@, ylab="Predicted", main="Linear Model") 
abline(@, 1) 
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Random Forest Linear Model 
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Chapter 109: RESTful R Services 


OpenCPU uses standard R packaging to develop, ship and deploy web applications. 


Section 109.1: opencpu Apps 
The official website contain good exemple of apps: https://www.opencpu.org/apps.html 


The following code is used to serve a R session: 


library(opencpu) 
opencpu$start(port = 5936) 


After this code is executed, you can use URLs to access the functions of the R session. The result could be XML, 
html, JSON or some other defined formats. 


For exemple, the previous R session can be accessed by a cURL call: 


#curl uses http post method for -X POST or -d "“arg=value" 
curl http://localhost :5936/ocpu/library/MASS/scripts/ch@1.R -X POST 
curl http://localhost :5936/ocpu/library/stats/R/rnorm -d "n=10&mean=5" 


The call is asynchronous, meaning that the R session is not blocked while waiting for the call to finish (contrary to 


shiny). 
The call result is kept in a temporary session stored in /ocpu/tmp/ 


An exemple of how to retrieve the temporary session: 


curl https://public.opencpu.org/ocpu/library/stats/R/rnorm -d n=5 
/ocpu/tmp/x@09f9e7630/R/.val 

/ocpu/tmp/x@09f9e7630/stdout 

/ocpu/tmp/x009f9e7638/source 

/ocpu/tmp/x@89f9e7638/console 

/ocpu/tmp/x009f9e7630/info 


x009f9e76380 is the name of the session. 


Pointing to /ocpu/tmp/x809f9e7630/R/.val will return the value resulting of rnorm(5), 
/ocpu/tmp/x809f9e7630/R/console will return the content of the console of rnorm(5), etc.. 
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Chapter 110: Machine learning 


Section 110.1: Creating a Random Forest model 


One example of machine learning algorithms is the Random Forest alogrithm (Breiman, L. (2001). Random Forests. 
Machine Learning 45(5), p. 5-32). This algorithm is implemented in R according to Breiman's original Fortran 
implementation in the randomForest package. 


Random Forest classifier objects can be created in R by preparing the class variable as factor, which is already 
apparent in the iris data set. Therefore we can easily create a Random Forest by: 


library(randomForest) 


rf <- randomForest(x = iris[, 1:4], 
y = irisS$Species, 
ntree = 500, 
do.trace = 100) 


rf 

# Call: 

# randomForest(x = iris[, 1:4], y = iris$Species, ntree = 500, do.trace = 10@) 
# Type of random forest: classification 

# Number of trees: 500 

# No. of variables tried at each split: 2 

# 

# O0B estimate of error rate: 4% 

# Confusion matrix: 

# setosa versicolor virginica class.error 

# setosa 58 7) 2) @.00 
# versicolor 3) 47 3 @.06 
# virginica Q 3 47 @.06 


parameters Description 
Xx a data frame holding the describing variables of the classes 


the classes of the individual obserbations. If this vector is factor, a classification model is created, if 
y not a regression model is created. 


ntree The number of individual CART trees built 
do.trace every ith step, the out-of-the-box errors overall and for each class are returned 
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Chapter 111: Using texreg to export models 
in a paper-ready way 


The texreg package helps to export a model (or several models) in a neat paper-ready way. The result may be 
exported as HTML or .doc (MS Office Word). 


Section 111.1: Printing linear regression results 


# models 

fit1 <- lm(mpg ~ wt, data = mtcars) 

fit2 <- lm(mpg ~ wtthp, data = mtcars) 
fit3 <- 1lm(mpg ~ wt+hptcyl, data = mtcars) 


# export to html 


texreg: :htmlreg(list(fit1, fit2, fit3), file='models.html' ) 


# export to doc 
texreg: :htmlreg(list(fit1, fit2, fit3), file='models.doc' ) 


The result looks like a table in a paper. 


Model 1 Model2 Model 3 
(Intercept) 37.29*** 37.23+** 38.75+#« 
(1.88) (1.60) (1.79) 


wt -5.34*** -3.88+** -3.17*** 
(0.56) (0.63) (0.74) 
hp -0.03** -0.02 
(0.01) (0.01) 
cyl -0.94 
(0.55) 
R2 0.75 0.83 0.84 
Adj. R2 0.74 0.81 0.83 
Num. obs. 32 32 32 


RMSE 3.05 259 23 
ep < 0.001, p< 0.01, p< 0.05 
Statistical models 


There are several additional handy parameters in texreg: :htmlreg() function. Here is a use case for the most 
helpful parameters. 


# export to html 

texreg: :htmlreg(list(fit1, fit2, fit3),file='models.html', 
single.row = T, 
custom.model.names = LETTERS[1:3], 
leading.zero = F, 
digits = 3) 


Which result in a table like this 
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A B 
(Intercept) 37.285 (1.878)*** 37.227 (1.599)*** 


wt -5.344 (0.559)*** -3.878 (0.633)*** 
hp -0.032 (0.009)=* 
cyl 
R2 0.753 0.827 
Adj. R2 0.745 0.815 
Num. obs. 32 32 
RMSE 3.046 2.593 
+p < 0.001, **p < 0.01, p< 0.05 
Statistical models 
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c 

38.752 (1.787)*** 
-3.167 (0.741)**# 
-0.018 (0.012) 
-0.942 (0.551) 
0.843 

0.826 

32 

2.512 
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Chapter 112: Publishing 


There are many ways of formatting R code, tables and graphs for publishing. 


Section 112.1: Formatting tables 


Here, "table" is meant broadly (covering data. frame, table, 
Printing to plain text 
Printing (as seen in the console) might suffice for a plain-text document to be viewed in monospaced font: 


Note: Before making the example data below, make sure you're in an empty folder you can write to. Run getwd() and 
read ?setwd if you need to change folders. 


..W = options()S$width 
options(width = 500) # reduce text wrapping 
sink(file = "mytab.txt") 
summary(mtcars) 
sink() 
options(width = ..w) 
rm(..w) 


Printing delimited tables 


Writing to CSV (or another common format) and then opening in a spreadsheet editor to apply finishing touches is 
another option: 


Note: Before making the example data below, make sure you're in an empty folder you can write to. Run getwd() and 
read ?setwd if you need to change folders. 


write.csv(mtcars, file="mytab.csv") 


Further resources 


e knitr: :kable 

e stargazer 

e tables: : tabular 
e texreg 

e xtable 


Section 112.2: Formatting entire documents 
Sweave from the utils package allows for formatting code, prose, graphs and tables together in a LaTeX document. 


Further Resources 


e Knitr and RMarkdown 
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Chapter 113: Implement State Machine 
Pattern using S4 Class 


Finite States Machine concepts are usually implemented under Object Oriented Programming (OOP) languages, for 
example using Java language, based on the State pattern defined in GOF (refers to the book: "Design Patterns"). 


R provides several mechanisms to simulate the OO paradigm, let's apply S4 Object System for implementing this 
pattern. 


Section 113.1: Parsing Lines using State Machine 


Let's apply the State Machine pattern for parsing lines with the specific pattern using S4 Class feature from R. 
PROBLEM ENUNCIATION 


We need to parse a file where each line provides information about a person, using a delimiter (";"), but some 
information provided is optional, and instead of providing an empty field, it is missing. On each line we can have the 
following information: Name ; [ Address; ]Phone. Where the address information is optional, sometimes we have it 
and sometimes don’t, for example: 


GREGORY BROWN; 25 NE 25TH; +1-786-987-6543 
DAVID SMITH; 786-123-4567 
ALAN PEREZ; 25 SE 5@TH; +1-786-987-5553 


The second line does not provide address information. Therefore the number of delimiters may be deferent like in 
this case with one delimiter and for the other lines two delimiters. Because the number of delimiters may vary, one 
way to atack this problem is to recognize the presence or not of a given field based on its pattern. In such case we 
can use a regular expression for identifying such patterns. For example: 


e Name: "4([A-Z]'?\\st+)* *[A-Z]+(\\st[A-Z]{1,2}\\V.?,? +)*[A-Z]+((-|\\st) [A-Z]+) *$". For example: 
RAFAEL REAL, DAVID R. SMITH, ERNESTO PEREZ GONZALEZ, ®' CONNOR BROWN, LUIS PEREZ-MENA, etc. 

e Address: "*\\s[0-9]{1,4}(\\st[A-Z] {1,2}[0-9]{1,2}[A-Z] {1,2} | [A-Z\\s0-9]+)$". For example: 11020 
LE JEUNE ROAD, 87 SW 27TH. For the sake of simplicity we don't include here the zipcode, city, state, but | can 
be included in this field or adding additional fields. 

e Phone: "*\\s*(\\+1(-|\\s+) )*[0-9] {3}(-| \\s+) [0-9] {3} (-|\\s+) [6-9] {4}S". For example: 
305-123-4567, 305 123 4567, +1-786-123-4567 


Notes: 


e |am considering the most common pattern of US addresses and phones, it can be easy extended to consider 
more general situations. 

e In Rthe sign "\" has special meaning for character variables, therefore we need to escape it. 

e In order to simplify the process of defining regular expressions a good recommendation is to use the 
following web page: regex101.com, so you can play with it, with a given example, until you get the expected 
result for all possible combinations. 


The idea is to identify each line field based on previously defined patterns. The State pattern define the following 
entities (classes) that collaborate to control the specific behavior (The State Pattern is a behavior pattern): 
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handle() 


Let's describe each element considering the context of our problem: 


¢ Context: Stores the context information of the parsing process, i.e. the current state and handles the entire 
State Machine Process. For each state, an action is executed (handle()), but the context delegates it, based 
on the state, on the action method defined for a particular state (handle() from State class). It defines the 
interface of interest to clients. Our Context class can be defined like this: 

o Attributes: state 
o Methods: handle(), ... 
e State: The abstract class that represents any state of the State Machine. It defines an interface for 
encapsulating the behavior associated with a particular state of the context. It can be defined like this: 
o Attributes: name, pattern 
o Methods: doAction(), isState (using pattern attribute verify whether the input argument belong to 
this state pattern or not), ... 

* Concrete States (state sub-classes): Each subclass of the class State that implements a behavior associated 
with a state of the Context. Our sub-classes are: InitState, NameState, AddressState, PhoneState. Such 
classes just implements the generic method using the specific logic for such states. No additional attributes 
are required. 


Note: It is a matter of preference how to name the method that carries out the action, handle(), doAction() or 
goNext(). The method name doAction() can be the same for both classes (Stateor Context) we preferred to name 
as handle() in the Context class for avoiding a confusion when defining two generic methods with the same input 
arguments, but different class. 


PERSON CLASS 


Using the S4 syntax we can define a Person class like this: 


setClass(Class = "Person", 
slots = c(name = "character", address = "character", phone = "character") 


) 


It is a good recommendation to initialize the class attributes. The setClass documentation suggests using a generic 
method labeled as "initialize", instead of using deprecated attributes such as: prototype, representation. 


setMethod("initialize", "Person", 
definition = function(.Object, name = NA_character_, 
address = NA_character_, phone = NA_character_) { 
.Object@name <- name 
.Object@address <- address 
.Object@phone <- phone 
Object 


Because the initialize method is already a standard generic method of package methods, we need to respect the 
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original argument definition. We can verify it typing on R prompt: 
> initialize 
It returns the entire function definition, you can see at the top who the function is defined like: 


function (.Object, ...) {...} 


Therefore when we use setMethod we need to follow exacity the same syntax (. Object). 


Another existing generic method is show, it is equivalent toString() method from Java and it is a good idea to have 
a specific implementation for class domain: 


setMethod("show", signature = "Person", 
definition = function(object) { 
info <- sprintf("%s@[name='%s', address='%s', phone='%s']", 
class(object), object@name, object@address, object@phone) 
cat(info) 
invisible(NULL) 


Note: We use the same convention as in the default toString() Java implementation. 


Let's say we want to save the parsed information (a list of Person objects) into a dataset, then we should be able 
first to convert a list of objects to into something the R can transform (for example coerce the object as a list). We 
can define the following additional method (for more detail about this see the post) 


setGeneric(name = "as.list", signature = c('x'), 
def = function(x) standardGeneric("as.list")) 


# Suggestion taken from here: 
# 
http: //stackoverflow.com/questions/30386009/how-to-extend-as-list-in-a-canonical-way-to-s4-objects 
setMethod("as.list", signature = "Person", 
definition = function(x) { 
mapply(function(y) { 
#apply as.list if the slot is again an user-defined object 
#therefore, as.list gets applied recursively 
if (inherits(slot(x,y),"Person")) { 
as.list(slot(x,y)) 
} else { 
#otherwise just return the slot 
slot(x,y) 
} 
}, 
slotNames(class(x)), 
SIMPLIFY=FALSE) 


R does not provide a sugar syntax for OO because the language was initially conceived to provide valuable 
functions for Statisticians. Therefore each user method requires two parts: 1) the Definition part (via setGeneric) 
and 2) the implementation part (via setMethod). Like in the above example. 


STATE CLASS 


Following S4 syntax, let's define the abstract State class. 
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setClass(Class = "State", slots = c(name = "character", pattern = "character")) 


setMethod("initialize", "State", 
definition = function(.Object, name = NA_character_, pattern = NA_character_) { 
.Object@name <- name 
.Object@pattern <- pattern 


Object 
i 
) 
setMethod("show", signature = "State", 
definition = function(object) { 
info <- sprintf("%s@[name='%s', pattern='%s']", class(object), 
object@name, object@pattern) 
cat(info) 
invisible(NULL) 
} 
) 
setGeneric(name = "isState", signature = c('obj', ‘input'), 
def = function(obj, input) standardGeneric("isState") ) 
setGeneric(name = "doAction", signature = c('obj', ‘input', ‘context'), 


def = function(obj, input, context) standardGeneric("doAction") ) 


Every sub-class from State will have associated a name and pattern, but also a way to identify whether a given 
input belongs to this state or not (isState() method), and also implement the corresponding actions for this state 
(doAction() method). 


In order to understand the process, let's define the transition matrix for each state based on the input received: 


Input/Current State Init Name Address Phone 


Name Name 

Address Address 

Phone Phone Phone 

End End 


Note: The cell [row, col]=[i, j] represents the destination state for the current state j, when it receives the input 


i. 


It means that under the state Name it can receive two inputs: an address or a phone number. Another way to 
represents the transaction table is using the following UML State Machine diagram: 
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oe. Name Name . is error 


is Addres is Phone 


Address is Phone 


IS error 


is error 


is error: when the input argument has an invalid pattern 


Let's implement each particular state as a sub-state of the class State 


STATE SUB-CLASSES 


Init State: 


The initial state will be implemented via the following class: 


setClass("InitState", contains = "State") 


setMethod("initialize", "InitState", 
definition = function(.Object, name = "init", pattern = NA_character_) { 


) 


} 


.Object@name <- name 
.Object@pattern <- pattern 
Object 


setMethod("show", signature = "InitState", 
definition = function(object) { 


) 


} 


callNextMethod( ) 


In R to indicate a class is a sub-class of other class is using the attribute contains and indicating the class name of 


the parent class. 
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Because the sub-classes just implement the generic methods, without adding additional attributes, then the show 
method, just call the equivalent method from the upper class (via method: cal1NextMethod()) 


The initial state does not have associated a pattern, it just represents the beginning of the process, then we initialize 
the class with an NA value. 


Now lets to implement the generic methods from the State class: 


setMethod(f = "“isState", signature = "InitState", 
definition = function(obj, input) { 
nameState <- new("NameState") 
result <- isState(nameState, input) 
return(result) 


For this particular state (without pattern), the idea it just initializes the parsing process expecting the first field will 
be a name, otherwise it will be an error. 


setMethod(f = "doAction", signature = "InitState", 
definition = function(obj, input, context) { 

nameState <- new("NameState" ) 

if (isState(nameState, input)) { 
person <- context@person 
person@name <- trimws(input) 
context@person <- person 
context@state <- nameState 

} else { 
msg <- sprintf("The input argument: '%s' cannot be identified", input) 
stop(msg) 

} 


return(context) 


The doAction method provides the transition and updates the context with the information extracted. Here we are 
accessing to context information via the @-operator. Instead, we can define get /set methods, to encapsulate this 
process (as it is mandated in OO best practices: encapsulation), but that would add four more methods per get-set 
without adding value for the purpose of this example. 


It is a good recommendation in all doAction implementation, to add a safeguard when the input argument is not 
properly identified. 


Name State 


Here is the definition of this class definition: 


setClass ("NameState", contains = "State") 


setMethod("initialize", "NameState", 
definition=function(.Object, name="name", 
pattern = “4([A-Z]'?\\st)* *[A-Z]+(\\s+[A-Z]{1,2}\\V.?,? +)*[A-Z]+((-|\\s+) [A-Z]+)*$") { 
.Object@pattern <- pattern 
.Object@name <- name 
Object 
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setMethod("show", signature = "NameState", 
definition = function(object) { 
callNextMethod( ) 
} 
) 


We use the function grep] for verifying the input belongs to a given pattern. 


setMethod(f="isState", signature="NameState", 
definition=function(obj, input) { 
result <- grepl(obj@pattern, input, perl=TRUE) 
return(result) 


Now we define the action to carry out for a given state: 


setMethod(f = "doAction", signature = "NameState", 
definition=function(obj, input, context) { 

addressState <- new("AddressState" ) 

phoneState <- new("PhoneState" ) 

person <- context@person 

if (isState(addressState, input)) { 
person@address <- trimws(input) 
context@person <- person 
context@state <- addressState 

} else if (isState(phoneState, input)) { 
person@phone <- trimws(input) 
context@person <- person 
context@state <- phoneState 

} else { 


msg <- sprintf("The input argument: '%s' cannot be identified", input) 


stop(msg) 
} 


return(context) 


Here we consider to possible transitions: one for Address state and the other one for Phone state. In all cases we 


update the context information: 


e The person information: address or phone with the input argument. 


e The state of the process 


The way to identify the state is to invoke the method: isState() for a particular state. We create a default specific 


states (addressState, phoneState) and then ask for a particular validation. 


The logic for the other sub-classes (one per state) implementation is very similar. 


Address State 


setClass("AddressState", contains = "State") 


setMethod("initialize", "AddressState", 
definition = function(.Object, name="address", 


pattern = "\\s[0-9]{1,4}(\\s+[A-Z] {1,2}[0-9] {1,2}[A-Z] {1,2}| [A-Z\\s@-9]+)$") { 


.Object@pattern <- pattern 
.Object@name <- name 
Object 
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) 


setMethod("show", signature = "AddressState", 
definition = function(object) { 
callNextMethod( ) 
} 
) 


setMethod(f="isState", signature="AddressState’", 
definition=function(obj, input) { 
result <- grepl(obj@pattern, input, perl=TRUE) 
return(result) 


) 


setMethod(f = "doAction", "AddressState", 
definition=function(obj, input, context) { 
phoneState <- new("PhoneState" ) 
if (isState(phoneState, input)) { 
person <- context@person 
person@phone <- trimws(input) 
context@person <- person 
context@state <- phoneState 


} else { 
msg <- sprintf("The input argument: '%s' cannot be identified", input) 
stop(msg) 

} 


return(context) 


Phone State 


setClass("PhoneState", contains = "State") 
setMethod("initialize", "PhoneState", 
definition = function(.Object, name = "phone", 


pattern = "4\\s*(\\4+1(-|\\st+) )*[0-9]{3}(-|\\s+) [0-9] {3} (-|\\s+) [0-91{4}8") { 
.Object@pattern <- pattern 
.Object@name <- name 


Object 
} 

) 
setMethod("show", signature = "PhoneState", 

definition = function(object) { 

callNextMethod( ) 

} 

) 


setMethod(f = "“isState", signature = "PhoneState", 
definition = function(obj, input) { 
result <- grepl(obj@pattern, input, perl = TRUE) 
return(result) 


Here is where we add the person information into the list of persons of the context. 


setMethod(f = "doAction", "PhoneState", 
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definition = function(obj, input, context) { 
context <- addPerson(context, context@person) 
context@state <- new("InitState") 
return(context) 


CONTEXT CLASS 


Now the lets to explain the Context class implementation. We can define it considering the following attributes: 


setClass(Class = "Context", 
slots = c(state = "State", persons = "list", person = "Person") 


) 
Where 


e state: The current state of the process 


¢ person: The current person, it represents the information we have already parsed from the current line. 


¢ persons: The list of parsed persons processed. 


Note: Optionally, we can add a name to identify the context by name in case we are working with more than one 


parser type. 


setMethod(f="initialize", signature="Context", 
definition = function(.Object) { 
.Object@state <- new("InitState") 
.Object@persons <- list() 
.Object@person <- new("Person" ) 
return( .Object) 


} 
) 
setMethod("show", signature = "Context", 
definition = function(object) { 
cat("An object of class ", class(object), "\n", sep = "") 


info <- sprintf("[state='%s', persons='%s', person='%s']", object@state, 
toString(object@persons), object@person) 


cat(info) 
invisible(NULL) 
} 
) 
setGeneric(name = "handle", signature = c('obj', ‘input', ‘context'), 


def = function(obj, input, context) standardGeneric("handle") ) 


setGeneric(name = "addPerson", signature = c('obj', ‘person'), 
def = function(obj, person) standardGeneric('"addPerson")) 


setGeneric(name = "parseLine", signature = c('obj', ‘s'), 
def = function(obj, s) standardGeneric("parseLine’ ) ) 


setGeneric(name = "parseLines", signature = c('obj', '‘s'), 
def = function(obj, s) standardGeneric("parseLines'’ ) ) 


setGeneric(name = "as.df", signature = c('obj'), 
def = function(obj) standardGeneric('"as.df") ) 


With such generic methods, we control the entire behavior of the parsing process: 
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handle(): Will invoke the particular doAction() method of the current state. 
e addPerson: Once we reach the end state, we need to add a person to the list of persons we have parsed. 
parseLine(): Parse a single line 


parseLines(): Parse multiple lines (an array of lines) 
e as.df(): Extract the information from persons list into a data frame object. 


Let's go on now with the corresponding implementations: 


handle() method, delegates on doAction() method from the current state of the context: 


setMethod(f = "handle", signature = "Context", 
definition = function(obj, input) { 
obj <- doAction(obj@state, input, obj) 
return(obj) 


) 


setMethod(f = "“addPerson", signature = "Context", 
definition = function(obj, person) { 
obj@persons <- c(obj@persons, person) 
return(obj) 


First, we split the original line in an array using the delimiter to identify each element via the R-function strsplit(), 
then iterate for each element as an input value for a given state. The handle() method returns again the context 
with the updated information (state, person, persons attribute). 


setMethod(f = "parseLine", signature = "Context", 
definition = function(obj, s) { 
elements <- strsplit(s, ";")[[1]] 
# Adding an empty field for considering the end state. 
elements <- c(elements, "") 
n <- length(elements) 
input <- NULL 
for (i in (1:n)) { 
input <- elements[i] 
obj <- handle(obj, input) 
} 


return(obj@person) 


Becuase R makes a copy of the input argument, we need to return the context (obj): 


setMethod(f = "parseLines", signature = "Context", 
definition = function(obj, s) { 

n <- length(s) 

listOfPersons <- list() 

for (i in (1:n)) { 
ipersons <- parseLine(obj, s[i]) 
listOfPersons[[i]] <- ipersons 

} 

obj@persons <- listOfPersons 

return(obj) 
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The attribute persons is a list of instance of S4 Person class. This something cannot be coerced to any standard type 
because R does not know of to treat an instance of a user defined class. The solution is to convert a Person into a 
list, using the as. list method previously defined. Then we can apply this function to each element of the list 
persons, Via the lapply() function. Then in the next invocation to lappy() function, now applies the data. frame 
function for converting each element of the persons. list into a data frame. Finally, the rbind() function is called 
for adding each element converted as a new row of the data frame generated (for more detail about this see this 


post) 


# Sugestion taken from this post: 

# http://stackoverflow.com/questions/4227223/r-list-to-data-frame 

setMethod(f = "as.df", signature = "Context", 

definition = function(obj) { 

persons <- obj@persons 
persons.list <- lapply(persons, as.list) 
persons.ds <- do.call(rbind, lapply(persons.list, data.frame, stringsAsFactors = FALSE)) 
return(persons.ds) 


PUTTING ALL TOGETHER 


Finally, lets to test the entire solution. Define the lines to parse where for the second line the address information is 
missing. 


s <- c( 
"GREGORY BROWN; 25 NE 25TH; +1-786-987-6543", 
"DAVID SMITH; 786-123-4567", 
ALAN) PEREZ 25: SE. 5H) 4-786-987-5553) 


Now we initialize the context, and parse the lines: 


context <- new("Context") 
context <- parseLines(context, s) 


Finally obtain the corresponding dataset and print it: 


df <- as.df(context) 


> df 

name address phone 
1 GREGORY BROWN 25 NE 25TH +1-786-987-6543 
2 DAVID SMITH <NA> 786-123-4567 


3 ALAN PEREZ 25 SE 50TH +1-786-987-5553 
Let's test now the show methods: 


> show(context@persons[[1]]) 
Person@[name='GREGORY BROWN', address='25 NE 25TH', phone='+1-786-987-6543' ] 


And for some sub-state: 


>show(new("PhoneState" ) ) 
PhoneState@[name='phone', pattern='4\s*(\+1(-|\s+))*[0-9] {3}(-|\s+) [8-9] {3}(-|\st+) [0-9] {4}$'] 


Finally, test the as.list() method: 
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> as. list(context@persons[[1]]) 
Sname 
[1] "GREGORY BROWN" 


Saddress 
[1] "25 NE 25TH" 


Sphone 
[1] "+1-786-987-6543" 


CONCLUSION 


This example shows how to implement the State pattern, using one of the available mechanisms from R for using 
the OO paradigm. Nevertheless, the R OO solution is not user-friendly and differs so much from other OOP 
languages. You need to switch your mindset because the syntax is completely different, it reminds more the 
functional programming paradigm. For example instead of: object.setID("A1") as in Java/C#, for R you have to 
invoke the method in this way: setID(object, "A1"). Therefore you always have to include the object as an input 
argument to provide the context of the function. On the same way, there is no special this class attribute and 
either a "." notation for accessing methods or attributes of the given class. It is more error prompt because to 
refer a class or methods is done via attribute value ("Person", "isState", etc.). 


Said the above, S4 class solution, requires much more lines of codes than a traditional Java/C# languages for doing 
simple tasks. Anyway, the State Pattern is a good and generic solution for such kind of problems. It simplifies the 
process delegating the logic into a particular state. Instead of having a big if-else block for controlling all 
situations, we have smaller if-else blocks inside on each State sub-class implementation for implementing the 
action to carry out in each state. 


Attachment: Here you can download the entire script. 


Any suggestion is welcome. 
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Chapter 114: Reshape using tidyr 


tidyr has two tools for reshaping data: gather (wide to long) and spread (long to wide). 


See Reshaping data for other options. 


Section 114.1: Reshape from long to wide format with spread() 


library(tidyr ) 


## example data 

set.seed(123) 

df <- data. frame( 
name = rep(c('"firstName", "secondName"), each=4), 
numbers = rep(1:4, 2), 
value = rnorm(8) 


) 

df 

# name numbers value 
#1 = £4firstName 1 -@.56047565 
#2 firstName 2 -@.23017749 
#3 firstName 3) 1255876834 
#4 firstName 4 @.07058839 
# 5 secondName 1 @.12928774 
# 6 secondName 2 1.71506499 
# 7 secondName 3 @.46091621 
# 8 secondName 4 -1.26506123 


We can "spread" the 'numbers' column, into separate columns: 


spread(data = df, 
key = numbers, 
value = value) 
# name 1 va 3 4 
#1 firstName -@.5604756 -@.2301775 1.5587083 @.07050839 
# 2 secondName @.1292877 1.7150650 0.4609162 -1.26506123 
Or spread the 'name' column into separate columns: 


spread(data = df, 
key = name, 
value = value) 
numbers firstName secondName 
1 1 -@.56047565 @.1292877 
2: 2 -@.23017749 1.7150650 
3 3 1.55870831 @.4609162 
4 4 @.07050839 -1.2650612 


Section 114.2: Reshape from wide to long format with gather() 


library(tidyr ) 


## example data 


df <- read.table(text =" numbers’ firstName secondName 
1 1 125862639 64087477 
2 2 @.1499581 @.9963923 
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3 3 @.4117353 @.3740009 

4 4 -@.4926862 @.4437916", header = T) 
df 

# numbers firstName secondName 

#1 1 1.5862639 0@.4087477 

#2: 2 @.1499581 @.9963923 

# 3 3 @.4117353 0.3740009 

#4 4 -@.4926862 0@.4437916 


We can gather the columns together using 'numbers' as the key column: 


gather(data = df, 
key = numbers, 
value = myValue) 


# numbers numbers myValue 
oa 1 firstName 1.5862639 
Hoe. 2 firstName @.1499581 
# 3 3 firstName @.4117353 
#4 4 firstName -@.4926862 
#5 1 secondName 0.4087477 
# 6 2 secondName @.9963923 
#7 3 secondName @.3740009 
# 8 4 secondName 0.4437916 
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Chapter 115: Modifying strings by 
substitution 


sub and gsub are used to edit strings using patterns. See Pattern Matching and Replacement for more on related 
functions and Regular Expressions for how to build a pattern. 


Section 115.1: Rearrange character strings using capture 
groups 


If you want to change the order of a character strings you can use parentheses in the pattern to group parts of the 
string together. These groups can in the replacement argument be addresed using consecutive numbers. 


The following example shows how you can reorder a vector of names of the form "surname, forename" into a 
vector of the form "forename surname". 


library(randomNames ) 
set.seed(1) 


strings <- randomNames(5) 

strings 

# [1] "Sigg, Zachary" "Holt, Jake" "Ortega, Sandra" "De La Torre, Nichole" 
# [5] "Perkins, Donovon" 


sub A(-#), \N\s(etye , “\N2Z \\1"@ Strings) 
# [1] "Zachary Sigg" "Jake Holt" "Sandra Ortega" "Nichole De La Torre" 
# [5] "Donovon Perkins" 


If you only need the surname you could just address the first pairs of parentheses. 


sub("4(.+),\\s(.+)", RN Na strings) 
7 [dle Sigg: nfo "Ortega" "De La Torre" "Perkins" 


Section 115.2: Eliminate duplicated consecutive elements 


Let's say we want to eliminate duplicated subsequence element from a string (it can be more than one). For 
example: 


2,14,14,14,19 

and convert it into: 

2,14,19 

Using gsub, we can achieve it: 


gsub("(\\d+)(,\\1)+#", "\AT", "2,14,14, 14,19") 
a ea ele 


It works also for more than one different repetition, for example: 
> gsub(y(\N\dt)G NN, NANI, 25141414 1919, 20) 21) 


[A 2 TA 9) 20) 241: 


GoalKicker.com - R Notes for Professionals 421 


Let's explain the regular expression: 


1. (\\d+):A group 1 delimited by () and finds any digit (at least one). Remember we need to use the double 
backslash (\\) here because for a character variable a backslash represents special escape character for 
literal string delimiters (\" or \'). \d\ is equivalent to: [@-9]. 

2. ,: A punctuation sign: , (we can include spaces or any other delimiter) 

3. \\1: An identical string to the group 1, i.e.: the repeated number. If that doesn't happen, then the pattern 


doesn't match. 
Let's try a similar situation: eliminate consecutive repeated words: 


one, two, two, three, four, four, five, six 


Then, just replace \d by \w, where \w matches any word character, including: any letter, digit or underscore. It is 
equivalent to [a-zA-Z@-9_]: 


> gsub("(\\wt)(,\\1)+", "\\1", “one, two, two, three, four, four, five, six") 


[1] "one, two, three, four, five, six" 
> 


Then, the above pattern includes as a particular case duplicated digits case. 
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Chapter 116: Non-standard evaluation and 
standard evaluation 


Dplyr and many modern libraries in R use non-standard evaluation (NSE) for interactive programming and standard 
evaluation (SE) for programming1. 


For instance, the summarise() function use non-standard evaluation but relies on the summarise_() which uses 
standard evaluation. 


The lazyeval library makes it easy to turn standard evaluation function into NSE functions. 


Section 116.1: Examples with standard dplyr verbs 


NSE functions should be used in interactive programming. However, when developping new functions in a new 
package, it's better to use SE version. 


Load dplyr and lazyeval : 


library(dplyr) 
library(lazyeval) 


Filtering 
NSE version 


filter(mtcars, cyl == 8) 
filter(mtcars, cyl < 6) 
filter(mtcars, cyl < 6 & vs == 1) 


SE version (to be use when programming functions in a new package) 


filter_(mtcars, .dots = list(~ cyl == 8)) 
filter_(mtcars, .dots = list(~ cyl < 6)) 
filter_(mtcars, .dots = list(~ cyl < 6, ~ vs == 1)) 


Summarise 
NSE version 


summarise(mtcars, mean(disp) ) 
summarise(mtcars, mean_disp = mean(disp) ) 


SE version 


summarise_(mtcars, .dots = lazyeval::interp(~ mean(x), x = quote(disp))) 

summarise_(mtcars, .dots = setNames(list(lazyeval::interp(~ mean(x), x = quote(disp))), 
“mean_disp")) 

summarise_(mtcars, .dots = list("mean_disp" = lazyeval::interp(~ mean(x), x = quote(disp) ))) 


Mutate 
NSE version 
mutate(mtcars, displ_l = disp / 61.0237) 
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SE version 


mutate_( 
.data = mtcars, 
.dots = list( 
“displ_1" = lazyeval: :interp( 
~ x / 61.0237, x = quote(disp) 
) 
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Chapter 117: Randomization 


The R language is commonly used for statistical analysis. As such, it contains a robust set of options for 
randomization. For specific information on sampling from probability distributions, see the documentation for 
distribution functions. 


Section 117.1: Random draws and permutations 


The sample command can be used to simulate classic probability problems like drawing from an urn with and 
without replacement, or creating random permutations. 


Note that throughout this example, set .seed is used to ensure that the example code is reproducible. However, 
samp1e will work without explicitly calling set .seed. 


Random permutation 


In the simplest form, sample creates a random permutation of a vector of integers. This can be accomplished with: 


set.seed(1251) 
sample(x = 10) 


[ett ezs 4 8 eres 1G) 5) 29 


When given no other arguments, sample returns a random permutation of the vector from 1 to x. This can be useful 
when trying to randomize the order of the rows in a data frame. This is a common task when creating 
randomization tables for trials, or when selecting a random subset of rows for analysis. 


library(datasets) 
set.seed(1171) 
iris_rand <- iris[sample(x = 1:nrow(iris)), ] 


> head(iris) 
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 


1 51 3.5 1.4 @.2 setosa 
2 4.9 3.0 1.4 @.2 setosa 
3 Ay Be ies @.2 setosa 
4 4.6 Sel 5 @.2 setosa 
5 Died) 3.0 1,4 @.2 setosa 
6 5.4 3.9 lew @.4 setosa 


> head(iris_rand) 
Sepal.Length Sepal.Width Petal.Length Petal .Width Species 


145 657 oe Bis a 2.5 virginica 
5) 5.8 3,6 1.4 Or 2 setosa 
85 5.4 3.0 4.5 1.5 versicolor 
137 6:3 3.4 Byas 2.4 virginica 
128 6.1 3.0 4.9 1.8 virginica 
105 665 3n9 iste! 2.2 virginica 


Draws without Replacement 


Using sample, we can also simulate drawing from a set with and without replacement. To sample without 
replacement (the default), you must provide sample with a set to be drawn from and the number of draws. The set 
to be drawn from is given as a vector. 


set.seed(7043) 
sample(x = LETTERS, size = 7) 
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[1] UGH up A [ed oy De We Re 


Note that if the argument to size is the same as the length of the argument to x, you are creating a random 
permutation. Also note that you cannot specify a size greater than the length of x when doing sampling without 
replacement. 


set.seed( 7305) 
sample(x = letters,size = 26) 


fala ue Lr ae NYE ey es is eB, Wa e ue! acl Ua ae Ou eu ue" "mM" La vie rin el wipe we ue "wt eu 


q sp 


sample(x = letters,size = 30) 
Error in sample.int(length(x), size, replace, prob) 
cannot take a sample larger than the population when ‘replace = FALSE' 


This brings us to drawing with replacement. 
Draws with Replacement 


To make random draws from a set with replacement, you use the replace argument to sample. By default, replace 
is FALSE. Setting it to TRUE means that each element of the set being drawn from may appear more than once in the 
final result. 


set.seed( 5062) 
sample(x = c("A","B","C","D"),size = 8,replace = TRUE) 


[I Dine Dei Biic Alas Aly ma Aaa Ay 
Changing Draw Probabilities 


By default, when you use sample, it assumes that the probability of picking each element is the same. Consider it as 
a basic "urn" problem. The code below is equivalent to drawing a colored marble out of an urn 20 times, writing 
down the color, and then putting the marble back in the urn. The urn contains one red, one blue, and one green 
marble, meaning that the probability of drawing each color is 1/3. 


set.seed(6472) 

sample(x = c("Red", "Blue", "Green" ) 
size = 20, 
replace = TRUE) 


Suppose that, instead, we wanted to perform the same task, but our urn contains 2 red marbles, 1 blue marble, and 
1 green marble. One option would be to change the argument we send to x to add an additional Red. However, a 
better choice is to use the prob argument to sample. 


The prob argument accepts a vector with the probability of drawing each element. In our example above, the 
probability of drawing a red marble would be 1/2, while the probability of drawing a blue or a green marble would 
be 1/4. 


set.seed(28432) 

sample(x = c("Red", "Blue", "Green" ) 
size = 20, 
replace = TRUE, 
prob = c(@.50,0.25,0.25)) 


Counter-intuitively, the argument given to prob does not need to sum to 1. R will always transform the given 
arguments into probabilities that total to 1. For instance, consider our above example of 2 Red, 1 Blue, and 1 Green. 
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You can achieve the same results as our previous code using those numbers: 


set .seed(28432) 

frac_prob_example <- sample(x = c("Red","Blue", "Green"), 
size = 200, 
replace = TRUE, 
prob = c(@.50,0.25,0.25)) 


set .seed(28432) 

numeric_prob_example <- sample(x = c("Red", "Blue", "Green" ) 
size = 200, 
replace = TRUE, 
prob = c(2,1,1)) 


> identical(frac_prob_example, numeric_prob_example) 
[1] TRUE 


The major restriction is that you cannot set all the probabilities to be zero, and none of them can be less than zero. 


You can also utilize prob when replace is set to FALSE. In that situation, after each element is drawn, the 
proportions of the prob values for the remaining elements give the probability for the next draw. In this situation, 
you must have enough non-zero probabilities to reach the size of the sample you are drawing. For example: 


set.seed(21741) 

sample(x = c("Red", "Blue", "Green" ) 
size = 2, 
replace = FALSE, 
prob = c(@.8,0.19,0@.01)) 


In this example, Red is drawn in the first draw (as the first element). There was an 80% chance of Red being drawn, 
a 19% chance of Blue being drawn, and a 1% chance of Green being drawn. 


For the next draw, Red is no longer in the urn. The total of the probabilities among the remaining items is 20% (19% 
for Blue and 1% for Green). For that draw, there is a 95% chance the item will be Blue (19/20) and a 5% chance it will 
be Green (1/20). 


Section 117.2: Setting the seed 


The set.seed function is used to set the random seed for all randomization functions. If you are using R to create a 
randomization that you want to be able to reproduce, you should use set. seed first. 


set.seed( 1643) 
samp1 <- sample(x = 1:5,size = 200, replace = TRUE) 


set.seed( 1643) 
samp2 <- sample(x = 1:5,size = 200,replace = TRUE) 


> identical(x = samp1,y = samp2) 
[1] TRUE 


Note that parallel processing requires special treatment of the random seed, described more elsewhere. 
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Chapter 118: Object-Oriented 
Programming in R 
This documentation page describes the four object systems in R and their high-level similarities and differences. 


Greater detail on each individual system can be found on its own topic page. 


The four systems are: S3, S4, Reference Classes, and S6. 


Section 118.1: S3 


The S3 object system is a very simple OO system in R. 
Every object has an S3 class. It can be get (got?) with the function class. 


> class(3) 
[1] "numeric" 


It can also be set with the function class: 


> bicycle <- 2 

> class(bicycle) <- ‘vehicle’ 
> class(bicycle) 

[1] "vehicle" 


It can also be set with the function attr: 


> velocipede <- 2 

> attr(velocipede, ‘class') <- ‘vehicle’ 
> class(velocipede) 

[1] "vehicle" 


An object can have many classes: 


> class(x = bicycle) <- c('human-powered vehicle', class(x = bicycle) ) 
> class(x = bicycle) 
[1] "human-powered vehicle 


vehicle" 


When using a generic function, R uses the first element of the class that has an available generic. 


For example: 

> summary.vehicle <- function(object, ...) { 
+ message('this is a vehicle’) 

ay 


> summary(object = my_bike) 
this is a vehicle 


But if we now define a summary .bicycle: 


> summary.bicycle <- function(object, ...) { 
+ message('this is a bicycle’) 
7} 


> summary(object = my_bike) 
this is a bicycle 
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Chapter 119: Coercion 


Coercion happens in R when the type of objects are changed during computation either implicitly or by using 
functions for explicit coercion (such as as.numeric, as.data.frame, etc.). 


Section 119.1: Implicit Coercion 


Coercion happens with data types in R, often implicitly, so that the data can accommodate all the values. For 
example, 


eal 23 
typeof (x) 
#[1] “integer” 


[2a 

x 

call) ye get ee 
typeof (x) 

#[1] "character" 


Notice that at first, x is of type integer. But when we assigned x[2] = “hi", all the elements of x were coerced into 
character as vectors in R can only hold data of single type. 
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Chapter 120: Standardize analyses by 
writing standalone R scripts 


If you want to routinely apply an R analysis to a lot of separate data files, or provide a repeatable analysis method 
to other people, an executable R script is a user-friendly way to do so. Instead of you or your user having to call R 
and execute your script inside R via source(.) or a function call, your user may simply call the script itself as if it 
was a program. 


Section 120.1: The basic structure of standalone R program 
and how to call it 


The first standalone R script 


Standalone R scripts are not executed by the program R (R. exe under Windows), but by a program called Rscript 
(Rscript.exe), which is included in your R installation by default. 


To hint at this fact, standalone R scripts start with a special line called Shebang line, which holds the following 
content: #!/usr/bin/env Rscript. Under Windows, an additional measure is needed, which is detailled later. 


The following simple standalone R script saves a histogram under the file name "hist.png" from numbers it receives 
as input: 


#!/usr/bin/env Rscript 


# User message (\n = end the line) 

cat("Input numbers, separated by space:\n") 

# Read user input as one string (n=1 -> Read only one line) 
input <- readLines(file('stdin'), n=1) 

# Split the string at each space (\\s == any space) 

input <- strsplit(input, "\\s")[[1]] 

# convert the obtained vector of strings to numbers 

input <- as.numeric(input) 


# Open the output picture file 
png("hist.png",width=400, height=300) 
# Draw the histogram 

hist (input) 

# Close the output file 

dev .off() 


You can see several key elements of a standalone R script. In the first line, you see the Shebang line. Followed by 
that, cat("....\n") is used to print a message to the user. Use file("stdin") whenever you want to specify "User 
input on console" as a data origin. This can be used instead of a file name in several data reading functions (scan, 
read.table, read.csv,...). After the user input is converted from strings to numbers, the plotting begins. There, it 
can be seen, that plotting commands which are meant to be written to a file must be enclosed in two commands. 
These are in this case png(.) and dev. of f(). The first function depends on the desired output file format (other 
common choices being jpeg(.) and pdf(.)). The second function, dev.off() is always required. It writes the plot 
to the file and ends the plotting process. 


Preparing a standalone R script 
Linux/Mac 


The standalone script's file must first be made executable. This can happen by right-clicking the file, opening 
"Properties" in the opening menu and checking the "Executable" checkbox in the "Permissions" tab. Alternatively, 
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the command 
chmod +x PATH/TO/SCRIPT/SCRIPTNAME.R 


can be called in a Terminal. 
Windows 
For each standalone script, a batch file must be written with the following contents: 


"C:\Program Files\R-XXXXXXX\bin\Rscript.exe" "%~dpO@\XXXXXXX.R" %* 


A batch file is a normal text file, but which has a *.bat extension except a *.txt extension. Create it using a text 
editor like notepad (not Word) or similar and put the file name into quotation marks "FILENAME .bat") in the save 
dialog. To edit an existing batch file, right-click on it and select "Edit". 


You have to adapt the code shown above everywhere XXX... is written: 


e Insert the correct folder where your R installation resides 
e Insert the correct name of your script and place it into the same directory as this batch file. 


Explanation of the elements in the code: The first part "C:\...\Rscript.exe" tells Windows where to find the 
Rscript.exe program. The second part "%~dp@\XXX.R" tells Rscript to execute the R script you've written which 
resides in the same folder as the batch file (s~dp@ stands for the batch file folder). Finally, %* forwards any 
command line arguments you give to the batch file to the R script. 


If you double-click on the batch file, the R script is executed. If you drag files on the batch file, the corresponding file 
names are given to the R script as command line arguments. 


Section 120.2: Using littler to execute R scripts 


littler (pronounced /ittle r) (cran) provides, besides other features, two possibilities to run R scripts from the 
command line with littler's r command (when one works with Linux or MacOS). 


Installing littler 
From R: 


install.packages("littler") 
The path of r is printed in the terminal, like 


You could link to the 'r' binary installed in 
"/home/*USER*/R/x86_64-pc-linux-gnu-library/3.4/littler/bin/r' 


from '/usr/local/bin' in order to use 'r' for scripting. 
To be able to call r from the system's command line, a symlink is needed: 


In -s /home/*USER*/R/x86_64-pc-linux-gnu-library/3.4/littler/bin/r /usr/local/bin/r 
Using apt-get (Debian, Ubuntu): 
sudo apt-get install littler 


Using littler with standard .r scripts 


With r from littler it is possible to execute standalone R scripts without any changes to the script. Example script: 
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# User message (\n = end the line) 

cat("Input numbers, separated by space:\n") 

# Read user input as one string (n=1 -> Read only one line) 
input <- readLines(file('stdin'), n=1) 

# Split the string at each space (\\s == any space) 
input <- strsplit(input, "\\s")[[1]] 

# convert the obtained vector of strings to numbers 
input <- as.numeric(input) 

# Open the output picture file 
png("hist.png",width=400, height=300) 

# Draw the histogram 

hist (input) 

# Close the output file 

dev .off() 


Note that no shebang is at the top of the scripts. When saved as for example hist.r, itis directly callable from the 
system command: 


r hist.r 


Using littler on shebanged scripts 
It is also possible to create executable R scripts with littler, with the use of the shebang 


#!/usr/bin/env r 


at the top of the script. The corresponding R script has to be made executable with chmod +X /path/to/script.r 
and is directly callable from the system terminal. 
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Chapter 121: Analyze tweets with R 


(Optional) Every topic has a focus. Tell the readers what they will find here and let future contributors know what 
belongs. 


Section 121.1: Download Tweets 


The first think you need to do is to download tweets. You need to Setup your tweeter account. Much Information 
can be found in Internet on how to do it. The following two links were useful for my Setup (last checked in May 
2017) 


In particular | found the following two links useful (last checked in May 2017): 


R Libraries 


You will need the following R packages 


library("devtools") 
library("twitteR") 
library("ROAuth") 


Supposing you have your keys You have to run the following code 


api_key <- XXXXXXXXXXXXXXXXXXXXXX 

api_secret <- XXXXXXXXXXXXXXXXXXXXXX 
access_token <- XXXXXXXXXXXXXXXXXXXXXX 
access_token_secret <- XXXXXXXXXXXXXXXXXXXXXX 


setup_twitter_oauth(api_key, api_secret) 


Change XXXXXXXXXXXXXXXXXXXXXX to your keys (if you have Setup your tweeter account you know which keys | 
mean). 


Let's now suppose we want to download tweets on coffee. The following code will do it 


search.string <- "#coffee" 
no.of.tweets <- 1000 


c_tweets <- searchTwitter(search.string, n=no.of.tweets, lang="en" 
You will get 1000 tweets on "coffee". 


Section 121.2: Get text of tweets 


Now we need to access the text of the tweets. So we do it in this way (we also need to clean up the tweets from 
special characters that for now we don't need, like emoticons with the sapply function.) 


coffee_tweets = sapply(c_tweets, function(t) tSgetText()) 
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coffee_tweets <- sapply(coffee_tweets, function(row) iconv(row, "latin1", "ASCII", sub="")) 


and you can check your tweets with the head function. 


head(coffee_tweets) 
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Chapter 122: Natural language processing 


Natural language processing (NLP) is the field of computer sciences focused on retrieving information from textual 
input generated by human beings. 


Section 122.1: Create a term frequency matrix 


The simplest approach to the problem (and the most commonly used so far) is to split sentences into tokens. 
Simplifying, words have abstract and subjective meanings to the people using and receiving them, tokens have an 
objective interpretation: an ordered sequence of characters (or bytes). Once sentences are split, the order of the 
token is disregarded. This approach to the problem in known as bag of words model. 


A term frequency is a dictionary, in which to each token is assigned a weight. In the first example, we construct a 
term frequency matrix from a corpus corpus (a collection of documents) with the R package tm. 


require(tm) 

doci <- "drugs hospitals doctors" 

doc2 <- "smog pollution environment" 

doc3 <- "doctors hospitals healthcare" 
doc4 <- "pollution environment water" 
corpus <- c(docl, doc2, doc3, doc4) 
tm_corpus <- Corpus(VectorSource(corpus) ) 


In this example, we created a corpus of class Corpus defined by the package tm with two functions Corpus and 
VectorSource, which returns a VectorSource object from a character vector. The object tm_corpus is a list our 
documents with additional (and optional) metadata to describe each document. 


str(tm_corpus) 
List of 4 
$ 1:List of 2 
..§ content: chr "drugs hospitals doctors" 
..$ meta :List of 7 


..$ author : chr(@) 

..§ datetimestamp: POSIX1t[1:1], format: "2017-06-03 00:31:34" 
..§ description : chr(@) 

..§ heading : chr(@) 

..$ id sechin eel: 

..S language : chr "en" 

..§ origin : chr(@) 


.. 2.7 attr(*, "class")= chr "TextDocumentMeta" 
..- attr(*, "“class")= chr [1:2] "PlainTextDocument 
[truncated ] 


TextDocument" 


Once we have a Corpus, we can proceed to preprocess the tokens contained in the Corpus to improve the quality of 
the final output (the term frequency matrix). To do this we use the tm function tm_map, which similarly to the apply 
family of functions, transform the documents in the corpus by applying a function to each document. 


tm_corpus <- tm_map(tm_corpus, tolower) 

tm_corpus <- tm_map(tm_corpus, removeWords, stopwords("english")) 
tm_corpus <- tm_map(tm_corpus, removeNumbers) 

tm_corpus <- tm_map(tm_corpus, PlainTextDocument) 

tm_corpus <- tm_map(tm_corpus, stemDocument, language="english" ) 
tm_corpus <- tm_map(tm_corpus, stripWhitespace) 

tm_corpus <- tm_map(tm_corpus, PlainTextDocument) 


Following these transformations, we finally create the term frequency matrix with 
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tdm <- TermDocumentMatrix(tm_corpus) 


which gives a 


<<TermDocumentMatrix (terms: 8, documents: 4)>> 


Non-/sparse entries: 12/280 


Sparsity : 62% 
Maximal term length: 9 
Weighting : term frequency (tf) 


that we can view by transforming it to a matrix 


as.matrix(tdm) 


Docs 
Terms character(@) character(@) character(@) character(@) 
doctor 1 7) 1 7) 
drug 1 7) 7) 7) 
environ 4) 1 4) 1 
healthcar 7) 7) 1 7) 
hospit 1 7) 1 7) 
pollut 7) 1 7) 1 
smog 2) 1 2) 3) 
water 4) 4) 4) 1 


Each row represents the frequency of each token - that as you noticed have been stemmed (e.g. environment to 


environ) - in each document (4 documents, 4 columns). 


In the previous lines, we have weighted each pair token/document with the absolute frequency (i.e. the number of 


instances of the token that appear in the document). 
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Chapter 123: R Markdown Notebooks (from 
RStudio) 


An R Notebook is an R Markdown document with chunks that can be executed independently and interactively, 
with output visible immediately beneath the input. They are similar to R Markdown documents with the exception 
of results being displayed in the R Notebook creation/edit mode rather than in the rendered output. Note: R 
Notebooks are new feature of RStudio and are only available in version 1.0 or higher of RStudio. 


Section 123.1: Creating a Notebook 


You can create a new notebook in RStudio with the menu command File -> New File -> R Notebook 
If you don't see the option for R Notebook, then you need to update your version of RStudio. For installation of 
RStudio follow this guide 


© RStudio - x 
File Edit Code View Plots Session Build Debug Profile Tools Help 
New File as R Script Ctrl+Shift+N | B) project: (None) + 
New Project... ~ reer - 7 - 
— - RNotebook af] Environment History 


Open File... Ctri+O 
Recent Files » 


<=? Gs impor datascty List + 


©) Global Environment ~ 


R Markdown... 
Shiny Web App... 


Open Project... 
Text File 


C++ File 


Open Project in New Session... 

Recent Projects y 
R Sweave 
R HTML 


Import Dataset » 


R Presentation 


R Documentation 


Files Plots Packages Help Viewer 


Ctrl+Q 


Quit Session... 


Section 123.2: Inserting Chunks 


Chunks are pieces of code that can be executed interactively. In-order to insert a new chunk by clicking on the 
insert button present on the notebook toolbar and select your desired code platform (R in this case, since we want 
to write R code). Alternatively we can use keyboard shortcuts to insert a new chunk Ctrl + Alt + 1 (OS X: Cmd + 
Option + I) 
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e RStudio 

File Edit Code View Plots Session Build Debug Profile Tools Help 

Cl--@-\ Bb Bp - > Addins + ®) Project: (None) + 
@) Untitled! Environment History 


fl“? QB Preview ~ bd —> * insert ~ “?Runy @ <? Gs Simport Dataset~ list + 
©) Global Environment ~ 


html_notebook 


Thi an [ ] (http: //rmarkdown. rstudio. com) Notebook. 
not <, the results appear beneath the code. 


Try executing this chunk by the *Run* button within the 
inside it and pressing *ctr1+s ter*. 


plot (cars 
Files Plots Packages Help Viewer 

F A ©) Install Updat 
Add a new chunk by clicking the *Insert Chun button on the toolbar or by pressi a © vodate 
Name Description Version 
When y ave the ebook, an HT ining the code and output will be saved alongside it 


(click t Pre /* button or pre cir t to preview the HTML file). vaerltrary 


Amelia 


arules 


arulesViz 


assertthat 


backports 


base6denc 
BH 
bindr 


bindrepp 


bitops 


broom 


R Markdown 


17:1 (Top Level) 


piaccae _ calools eo 2 seh 


Section 123.3: Executing Chunk Code 


You can run the current chunk by clicking Run current Chunk (green play button) present on the right side of the 
chunk. Alternatively we can use keyboard shortcut Ctrl + Shift + Enter (OS X: Cmd + Shift + Enter) 


The output from all the lines in the chunk will appear beneath the chunk. 
Splitting Code into Chunks 


Since a chunk produces its output beneath the chunk, when having multiple lines of code in a single chunk that 
produces multiples outputs it is often helpful to split into multiple chunks such that each chunk produces one 


output. 


To do this, select the code to you want to split into a new chunk and press Ctrl + Alt + | (OS X: Cmd + Option + 1) 
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e RStudio 

File Edit Code View Plots Session Build Debug Profile Tools Help 

Cl--@-\ Bb Bp - > Addins + ®) Project: (None) + 
@) Untitled! Environment History 


fl“? QB Preview ~ bd *@ Insert ~ “?Runy @ +> <? Gs Simport Dataset~ list + 


©) Global Environment ~ 


html_notebook 


Thi an [ ] (http: //rmarkdown. rstudio. com) Notebook. when you execute code within the 
not <, the results appear beneath the code. 


Try executing this chunk by the *Run* button within the chunk or by placing your cursor 
inside it and pressing *ctr1+s ter*. 


plot (cars 
Files Plots Packages Help Viewer 


Gl install ~@ Update 


Add a new chunk by clicking the *Insert Chun button on the toolbar or by pressi 
Name Description Version 


when you save the ebook, an HT ining the code and output will be saved alongside it 


(click t Pre /* button or pre cir t to preview the HTML file). vaerltrary 


Amelia 


arules 


arulesViz 


assertthat 


backports 


base6denc 
BH 
bindr 


bindrepp 


bitops 


broom 


44 © R Notebook > R Markdown 


pmccae _ calools eo 2 seh 


Section 123.4: Execution Progress 


When you execute code in a notebook, an indicator will appear in the gutter to show you execution progress. Lines 
of code which have been sent to R are marked with dark green; lines which have not yet been sent to R are marked 
with light green. 


Executing Multiple Chunks 


Running or Re-Running individual chunks by pressing Run for all the chunks present in a document can be painful. 
We can use Run All from the Insert menu in the toolbar to Run all the chunks present in the notebook. Keyboard 
shortcut is Ctrl + Alt + R (OS X: Cmd + Option + R) 


There’s also a option Restart R and Run All Chunks command (available in the Run menu on the editor toolbar), 
which gives you a fresh R session prior to running all the chunks. 


We also have options like Run All Chunks Above and Run All Chunks Below to run chunks Above or Below from a 
selected chunk. 
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© Rstudio 


File Edit Code View Plots Session Build Debug Profile Tools Help 
Ol-| ay- ~ Addins + 
@) Untitied1* 
“8° QAR) Preview ~ *@ Insert ~ 


dataC’ 
head(iris, 


Sepal.Length Sepal.Width Petal.Length 


ide Iris data to x (containt the all features) and y (only 


the classe 


subset(iris, select=-Species 
- iris$species 


Create SvM Model and show summary 


NNNN NLD 
yu WP 


svm_model svm(x,y. 


4 


summary (svm_model 


fomom--) 


Run Prediction 


pred <- predict(svm_model,x) 


WWHWWHWWn 


Aukunre 


you can time taken by using system. time 


74:1 


Console 


Before rendering the final version of a notebook we can preview the output. Click on the Preview button 


toolbar and select the desired output format. 


Petal.Width 


= Run + 


Species 


setosa 


Project: (None) + 


Environment 


Dp 5 


a 


History 


Import Dataset > 


©} Global Environment + 


Data 
iris 
x 


Files Plots 


Install 


Name 


User Library 


Amelia 


backports 


base64enc 


BH 


bindr 


bindrcpp 


bitops 


broom 


caTools 


150 obs. 
150 obs. 


Packages Help 


@ Update 


Description 


A Program 


s 


z 


of 5 
of 4 


Viewer 


for Miss: 


variables 
variables 


You can change the type of output by using the output options as "pdf_document" or "html_notebook" 


9 RStudio 
File Edit Code View Plots Session Build Debug Profile Tools Help 
Cl- ae j > Addins + 


@) Untitled1 
*@ Insert ~ 


htm]_notebook 


AuPWNR 


s is an[ 
ebook, 


] (http: / 
the results appear beneath the code. 


Try executing this chunk by clicking t Rur 
inside it and pressing r i 


plot(cars 


Add a new chunk by clicking the *I chunk* button on the toolbar or by pressing 


when you save the notebook, 


(click the button or to preview the HTML file). 


44 © R Notebook 


Console 


rmarkdown. rstudio.com) Notebook. when you execute code 


=? Run + 


thin 


the 


button within the chunk or by placing your cursor 


an HTML file containing the code and output will be saved alongside it 


R Markdown 


Environment History 
a _ Import Dataset ~ x 
©) Global Environment + 
Files Plots Packages Help Viewer 
Install @ Update 
Name Description 


User Library 


Amelia 


arules 


arulesViz 


assertthat 


backports 


baseG4enc 
BH 


bindr 


bindrepp 


bitops 


broom 


tive Bindings 


to Active Bindings 


List + 


Version 


on the 
= x 
Project: (None) 
List + 
Version 


When a notebook .Rmd is saved, an .nb.html file is created alongside it. This file is a self-contained HTML file which 
contains both a rendered copy of the notebook with all current chunk outputs (suitable for display on a website) 


and a copy of the notebook .Rmd itself. 


More info can be found at RStudio docs 
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Chapter 124: Aggregating data frames 


Aggregation is one of the most common uses for R. There are several ways to do so in R, which we will illustrate 
here. 


Section 124.1: Aggregating with data.table 


Grouping with the data.table package is done using the syntax dt[i, j, by] Whichcan be read out loud as: "Take 
dt, subset rows using i, then calculate j, grouped by by." Within the dt statement, multiple calculations or groups 
should be put in a list. Since an alias for list() is .(), both can be used interchangeably. In the examples below we 
use .(). 


CODE: 


# Aggregating with data.table 

library(data. table) 

dt = data.table(group=c("Group 1","Group 1","Group 2","Group 2"," 
CCAt ALS Aue AB. \eavailue=e(2. 2252.5) 

print(dt) 


Group 2"), subgroup = 


# sum, grouping by one column 
dt[, .(value=sum(value) ), group] 


# mean, grouping by one column 
dt[, . (value=mean(value) ), group] 


# sum, grouping by multiple columns 
dt[, .(value=sum(value)), . (group, subgroup) ] 


# custom function, grouping by one column 
# in this example we want the sum of all values larger than 2 per group. 
dt[, . (value=sum(value[value>2])), group] 


OUTPUT: 


> # Aggregating with data.table 
> library(data.table) 
> 
> dt = data.table(group=c("Group 1","Group 1","Group 2"," 
CCA ACA Ae Be value = C2. 2.5.1.2) 125)) 
> print(dt) 
group subgroup value 


Group 2","Group 2"), subgroup = 


1: Group 1 A 2.8 

2: Group 1 Ameer 

3: Group 2 A 1.0 

4: Group 2 AD 220 

5: Group 2 B i lees 

me 

> # sum, grouping by one column 
> dt[, .(value=sum(value) ), group] 


group value 
1: Group 1 4.5 
2: Group 2 4.5 
> 


> # mean, grouping by one column 


> dt[, .(value=mean(value) ), group] 
group value 
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1: Group 1 2.25 

2: Group 2 1.50 

> 

> # sum, grouping by multiple columns 

> dt[, .(value=sum(value)), .(group, subgroup) ] 
group subgroup value 

1: Group 1 A 4.5 

2: Group 2 A 3.0 

3: Group 2 B eS) 

> 

> # custom function, grouping by one column 

> # in this example we want the sum of all values larger than 2 per group. 

> dt[, .(value=sum(value[value>2])), group] 
group value 

1: Group 1 235 

2: Group 2 Q.0 


Section 124.2: Aggregating with base R 
For this, we will use the function aggregate, which can be used as follows: 


aggregate(formula, function, data) 


The following code shows various ways of using the aggregate function. 


CODE: 


data.frame(group=c("Group 1 Group 1 Group 2 


= 7 GrOUp: 2 
A", "A", "A", "A", "B"), value = ¢(2,2.5,1,2,1.5)) 


df 
c(" 


# sum, grouping by one column 
aggregate(value~group, FUN=sum, data=df) 


# mean, grouping by one column 
aggregate(value~group, FUN=mean, data=df) 


# sum, grouping by multiple columns 
aggregate(value~group+subgroup, FUN=sum, data=df ) 


# custom function, grouping by one column 
# in this example we want the sum of all values larger than 2 per group. 
aggregate(value~group, FUN=function(x) sum(x[x>2]), data=df) 


OUTPUT: 


> df = data.frame(group=c("Group 1","Group 1","Group 2 
CCAg yA cNe AG Bio) Valles = ¢(2,2.5,1,2,1.5)) 
> print(df) 

group subgroup value 


1 Group 1 A 2-50 

2 Group 1 AD 28 

3 Group 2 A ies) 

4 Group 2 AS 250 

5 Group 2 B as 

= 

> # sum, grouping by one column 

> aggregate(value~group, FUN=sum, data=df) 


group value 
Group 1 4.5 


=a 
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, Group 2"), subgroup = 


, Group 2","Group 2"), subgroup = 


443 


2 Group 2 4.5 
= 
> # mean, grouping by one column 
> aggregate(value~group, FUN=mean, data=df) 
group value 
1 Group 1 2.25 
2 Group 2 1.50 
> 
> # sum, grouping by multiple columns 
> aggregate(value~group+subgroup, FUN=sum, data=df ) 
group subgroup value 
1 Group 1 A 4.5 
2 Group 2 A 3nd 
3 Group 2 B eo 
> 
> # custom function, grouping by one column 
> # in this example we want the sum of all values larger than 2 per group. 
> aggregate(value~group, FUN=function(x) sum(x[x>2]), data=df) 


group value 
Group 1 25 
2 Group 2 Q.0 


— 


Section 124.3: Aggregating with dplyr 


Aggregating with dplyr is easy! You can use the group_by() and the summarize() functions for this. Some examples 
are given below. 


CODE: 


# Aggregating with dplyr 

library(dplyr) 

df = data.frame(group=c("Group 1","Group 1","Group 2","Group 2"," 
COAL vAue Ay AaB. \evailue=e(2.225e lee ile 5))) 

print(df) 


Group 2"), subgroup = 


# sum, grouping by one column 
df %>% group_by(group) %>% summarize(value = sum(value)) %>% as.data.frame() 


# mean, grouping by one column 
df %>% group_by(group) %>% summarize(value = mean(value)) %>% as.data.frame() 


# sum, grouping by multiple columns 
df %>% group_by(group, subgroup) %>% summarize(value = sum(value)) %>% as.data.frame() 


# custom function, grouping by one column 
# in this example we want the sum of all values larger than 2 per group. 
df %>% group_by(group) %>% summarize(value = sum(value[value>2])) %>% as.data.frame() 


OUTPUT: 


> library(dplyr) 
> 
> df = data.frame(group=c("Group 1","Group 1","Group 2"," 
(eA vAw eA awAa” mba \ray dluem— ac (2n25)5 dealin) 
> print(df) 
group subgroup value 


Group 2","Group 2"), subgroup = 


1 Group 1 A 2.0 
2 Group 1 Ay 25 
3 Group 2 A 1.0 
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4 Group 2 Ame 28 

5 Group 2 B eS 

> 

> # sum, grouping by one column 

> df %>% group_by(group) %>% summarize(value = sum(value)) %>% as.data.frame() 
group value 

1 Group 1 4.5 

2 Group 2 4.5 

> 

> # mean, grouping by one column 

> df %>% group_by(group) %>% summarize(value = mean(value)) %>% as.data.frame() 
group value 

1 Group 1 2.25 

2 Group 2 1.50 

> 

> # sum, grouping by multiple columns 

> df %>% group_by(group,subgroup) %>% summarize(value = sum(value)) %>% as.data.frame() 
group subgroup value 

1 Group 1 Avene ao: 

2 Group 2 A 3.8 

3 Group 2 B ihe 

> 

> # custom function, grouping by one column 

> # in this example we want the sum of all values larger than 2 per group. 

> df %>% group_by(group) %>% summarize(value = sum(value[value>2])) %>% as.data.frame() 


group value 
1 Group 1 (ge) 
2 Group 2 @.0 
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Chapter 125: Data acquisition 


Get data directly into an R session. One of the nice features of R is the ease of data acquisition. There are several 
ways data dissemination using R packages. 


Section 125.1: Built-in datasets 


Rhas a vast collection of built-in datasets. Usually, they are used for teaching purposes to create quick and easily 
reproducible examples. There is a nice web-page listing the built-in datasets: 


https://vincentarelbundock.github.io/Rdatasets/datasets.html 


Example 


Swiss Fertility and Socioeconomic Indicators (1888) Data. Let's check the difference in fertility based of rurality and 
domination of Catholic population. 


library(tidyverse) 


swiss %>% 
ggplot(aes(x = Agriculture, y = Fertility, 
color = Catholic > 5@))+ 
geom_point()+ 
stat_ellipse() 


110 
90 
i Catholic > 50 
= —e— FALSE 
70 
LL -e TRUE 
50 


) 30 60 90 
Agriculture 


Section 125.2: Packages to access open databases 
Numerous packages are created specifically to access some databases. Using them can save a bunch of time on 
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reading/formatting the data. 


Eurostat 


Even though eurostat package has a function search_eurostat(), it does not find all the relevant datasets 
available. This, it's more convenient to browse the code of a dataset manually at the Eurostat website: Countries 
Database, or Regional Database. If the automated download does not work, the data can be grabbed manually at 
via Bulk Download Facility. 


library(tidyverse) 
library(lubridate) 
library(forcats) 
library(eurostat) 
library(geofacet) 
library(viridis) 
library(ggthemes) 
library(extrafont) 


# download NEET data for countries 
neet <- get_eurostat("edat_lfse_22") 


neet %>% 
filter(geo %>% paste %>% nchar == 2, 
sex == "T", age == "Y18-24") %>% 
group_by(geo) %>% 
mutate(avg = values %>% mean()) %>% 
ungroup() %>% 
ggplot(aes(x = time %>% year() 
y = values))+ 
geom_path(aes(group = 1))+ 
geom_point(aes(fill = values), pch = 21)+ 
scale_x_continuous(breaks = seq(2000, 2015, 5), 
labels = c("2000", "'@5", "'10", "'15"))+ 
scale_y_continuous(expand = c(@, 9), limits = c(@, 4@))+ 
scale_fill_viridis("NEET, %", option = "B")+ 


facet_geo(~ geo, grid = "“eu_grid1")+ 
labs(x = "Year", 
y = "NEET, %", 
title = "Young people neither in employment nor in education and training in Europe", 


subtitle = "Data: Eurostat Regional Database, 2000-2016", 
caption = "ikashnitsky.github.io")+ 
theme_few(base_family = "Roboto Condensed", base_size = 15)+ 
theme(axis.text = element_text(size = 10), 
panel.spacing.x = unit(1, "lines") 
legend.position = c(@, Q@), 
legend. justification = c(@, @)) 
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Young people neither in employment nor in education and training in Europe 
Data: Eurostat Regional Database, 2000-2016 
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ikashnitsky.github.io 


Section 125.3: Packages to access restricted data 


Human Mortality Database 


Human Mortality Database is a project of the Max Planck Institute for Demographic Research that gathers and pre- 
process human mortality data for those countries, where more or less reliable statistics is available. 


# load required packages 
library(tidyverse) 
library(extrafont) 
library(HMDHFDp1lus) 


country <- getHMDcountries() 


exposures <- list() 
for (i in 1: length(country)) { 
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cnt <- country[i] 
exposures[[cnt]] <- readHMDweb(cnt, "Exposures_1x1", user_hmd, pass_hmd) 
# let's print the progress 
paste(i, ‘out of', length(country) ) 
} # this will take quite a lot of time 


Please note, the arguments user_hmd and pass_hmd are the login credentials at the website of Human Mortality 
Database. In order to access the data, one needs to create an account at http://www.mortality.org/ and provide 
their own credentials to the readHMDweb( ) function. 


sr_age <- list() 


for (i in 1:length(exposures)) { 
di <- exposures[[i]] 
sr_agei <- di %>% select(Year,Age,Female,Male) %>% 
filter(Year %in% 2012) %>% 
select(-Year) %>% 
transmute(country = names(exposures)[i], 
age = Age, sr_age = Male / Female * 100) 
sr_age[[i]] <- sr_agei 
} 


sr_age <- bind_rows(sr_age) 


# remove optional populations 
sr_age <- sr_age %>% filter(!country %in% c("FRACNP", "DEUTE", "DEUTW", "GBRCENW", "GBR_NP" ) ) 


# summarize all ages older than 9@ (too jerky) 

sr_age_9@ <- sr_age %>% filter(age %in% 98:1108) %>% 
group_by(country) %>% summarise(sr_age = mean(sr_age, na.rm = T)) %>% 
ungroup() %>% transmute(country, age=98, sr_age) 


df_plot <- bind_rows(sr_age %>% filter(!age %in% 98:118), sr_age_9@) 


# finaly - plot 
df_plot %>% 
ggplot(aes(age, sr_age, color = country, group = country) )+ 
geom_hline(yintercept = 100, color = ‘grey5@', size = 1)+ 
geom_line(size = 1)+ 
scale_y_continuous(limits = c(@, 128), expand = c(@, @), breaks = seq(9, 120, 20))+ 
scale_x_continuous(limits = c(@, 9@), expand = c(@, 8), breaks = seq(@, 80, 20))+ 
xlab('Age')+ 
ylab('Sex ratio, males per 100 females’ )+ 
facet_wrap(~country, ncol=6)+ 
theme_minimal(base_family = "Roboto Condensed", base_size = 15)+ 
theme(legend.position='none', 
panel.border = element_rect(size = .5, fill = NA)) 
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Section 125.4: Datasets within packages 


There are packages that include data or are created specifically to disseminate datasets. When such a package is 
loaded (library(pkg)), the attached datasets become available either as R objects; or they need to be called with 


the data() function. 


Gapminder 


A nice dataset on the development of countries. 


library(tidyverse) 
library(gapminder ) 


gapminder %>% 
ggplot(aes(x = year, y = lifeExp, 
color = continent) )+ 
geom_jitter(size = 1, alpha = .2, width = .75)+ 
stat_summary(geom = "path", fun.y = mean, size = 1)+ 
theme_minimal() 


80 
continent 
60 mer Africa 
= =—— Americas 
uu © 
2 — Asia 
=~ Europe 
mer Oceania 
40 


1960 1970 1980 1990 2000 2010 
year 


1950 


World Population Prospects 2015 - United Nations Population Department 


Let's see how the world has converged in male life expectancy at birth over 1950-2015. 


library(tidyverse) 
library(forcats) 
library(wpp20@15) 
library (ggjoy) 
library(viridis) 
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library(extrafont) 
data(UNlocations) 


countries <- UNlocations %>% 
filter(location_type == 4) %>% 
transmute(name = name %>% paste()) %>% 
as_vector() 


data(eQ@M) 


e@M %>% 
filter(country %in% countries) %>% 
select(-last.observed) %>% 
gather(period, value, 3:15) %>% 
ggplot(aes(x = value, y = period %>% fct_rev()))+ 
geom_joy(aes(fill = period) )+ 


scale_fill_viridis(discrete = T, option = "B", direction = -1, 
begin = .1, end = .9)+ 
labs(x = "Male life expectancy at birth", 
y = "Period", 
title = "The world convergence in male life expectancy at birth since 1950", 


subtitle = "Data: UNPD World Population Prospects 2015 Revision", 
caption = "ikashnitsky.github.io")+ 
theme_minimal(base_family = "Roboto Condensed", base_size = 15)+ 
theme(legend.position = "none") 


The world convergence in male life expectancy at birth since 1950 
Data: UNPD World Population Prospects 2015 Revision 
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Chapter 126: R memento by examples 


This topic is meant to be a memento about the R language without any text, with self-explanatory examples. 


Each example is meant to be as succint as possible. 


Section 126.1: Plotting (using plot) 


# Creates a 1 row - 2 columns format 
par (mfrow=c(1,2)) 


plot(rnorm(10@), main = "Graph 1", ylab = "Normal distribution") 


grid() 
legend(x = 40, y = -1, legend = "A legend") 


plot(rnorm(100), main = "Graph 2", type = "1") 
abline(v = 50) 


Result: 
Graph 2 
N N 
os = 
ii 
z Eo 
3 E 
Sx BE * 
5 ' 
Z So 
o 
yd 
0 20 40 60 80 86100 
Index Index 


Section 126.2: Commonly used functions 


# Create 100 standard normals in a vector 
x <- rnorm(100, mean = @, sd = 1) 


# Find the lenght of a vector 
length(x) 


# Compute the mean 
mean(x) 


# Compute the standard deviation 
sd(x) 


# Compute the median value 
median(x) 
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# Compute the range (min, max) 
range(x) 


# Sum an iterable 
sum(x) 


# Cumulative sum (x[1], x[1]+x[2], ...) 
cumsum(x) 


# Display the first 3 elements 
head(3, x) 


# Display min, 1st quartile, median, mean, 3rd quartile, max 


summary(x) 


# Compute successive difference between elements 
dif f(x) 


# Create a range from 1 to 1@ step 1 
LO 


# Create a range from 1 to 1@ step @.1 
seq(1, 10, 0.1) 


# Print a string 
print("hello world") 


Section 126.3: Data types 


Vectors 

a <= c(1l, 2-3) 

b <- c(4, 5, 6) 
mean_ab <- (a +b) / 2 


d <- c(1, 0, 1) 
only-123 <= ald == 1] 


Matrices 

mat <- matrix(c(1,2,3,4), nrow = 2, ncol = 2) 
dimnames(mat) <- list(c(), c("a", "b", "c")) 
mat[,] == mat 

Dataframes 


df <- data.frame(qualifiers = c("Buy", "Sell", "Sell") 
symbols = c("AAPL", "MSFT", "GOOGL"), 
values = c(326.0, 598.3, 201.5)) 

dfSsymbols == df[[2]] 

dfSsymbols == df[["symbols"]] 


df[ (2, 1)]| == "AAPL" 

Lists 

1 <- list(a = 500, "aaa", 98.2) 
length(1) == 3 
class(1[1]) Se Viste 
class(1[[1]]) == "numeric" 
class(1Sa) == "numeric" 
Environments 


env <- new.env() 
env[["foo"]] = "bar" 
env2 <- env 
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env2[["foo"]] = "BAR" 


env[["foo"]] == "BAR" 

get("foo", envir = env) == "BAR" 
rm("foo", envir = env) 
env[["foo"]] == NULL 


GoalKicker.com - R Notes for Professionals 456 


Chapter 127: Updating R version 


Installing or Updating your Software will give access to new features and bug fixes. Updating your R installation can 
be done in a couple of ways. One Simple way is go to R website and download the latest version for your system. 


Section 127.1: Installing from R Website 


To get the latest release go to https://cran.r-project.org/ and download the file for your operating system. Open the 
downloaded file and follow the on-screen installation steps. All the settings can be left on default unless you want 
to change a certain behaviour. 


Section 127.2: Updating from within R using installr Package 
You can also update R from within R by using a handy package called installr. 


Open R Console (NOT RStudio, this doesn't work from RStudio) and run the following code to install the package 
and initiate update. 


install.packages("installr" ) 
library("installr") 
updateR() 


G@ RGui (64-bit) — x 
File Edit View Misc Packages Windows Help _ installr 


Beetle 


> library(installr) 
Loading required package: stringr 


Welcome to installr version 0.19.0 


More information is available on the installr project website: 
https://github.com/talgalili/installr/ 


Contact: <tal.galili@gmail.com> 
Suggestions and bug-reports can be submitted at: https://github.com/talgalili/i$ 


Select Setup Language x 
sg Select the language to use during the 
installation: 


To suppress this message use: 
suppressPackageStartupMessages (library (in 


Warning message: 
package ‘installr’ was built under R version 3.4.1 

> updateR() 

Installing the newest version of R, 

please wait for the installer file to be download and executed. 
Be sure to click 'next' as needed. 

trying URL 'https://cran.rstudio. aT wit finda acepnas 4.1-win.exe 
Content type ‘application/x-msdos-program' length 78086510 bytes (74.5 MB) 
downloaded 74.5 MB 


Section 127.3: Deciding on the old packages 


Once the installation is finished click the Finish button. 


Now it asks if you want to copy your packages fro the older version of R to Newer version of R. Once you choose yes 
all the package are copied to the newer version of R. 
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GQ RGui (64-bit) ~ x 
File Edit View Misc Packages Windows Help _ installr 


(3) 


GR Console o |e is 


> library(instalir) 
Loading required package: stringr 


Welcome to installr version 0.19.0 


More information is available on the installr project website: 
https://github.com/talgalili/installr/ 


Contact: <tal.galili@gmail.com> 
Suggestions and bug-reports can be submitted at: https://github.com/talgalili/i$ 


To suppress this message use: 
suppressPackageStartupMessages Question 


Warning message: 
package ‘installr’ was built under R version 3.4.1 Do you wish to copy your packages from the older version of R to the 

> updateR () newer version of R? 

Installing the newest version of R, | 

Please wait for the installer file to be download and| 

Be sure to click 'next' as needed... rs 

trying URL "hnttps://cran.rstudio.com/bin/windows/base/# | Yes | [Ne] 
Content type '‘application/x-msdos-program' length 7808q ? 
downloaded 74.5 MB 


After that you can choose if you still want to keep the old packages or delete. 


GR RGui (64-bit) - x 
File Edit View Misc Packages Windows Help __ installr 


@ R Console o | 2 |x] 


> library(installr) 
Loading required package: stringr 


Welcome to installr version 0.19.0 


More information is available on the installr project website: 
https://github.com/talgalili/installr/ 


Contact: <tal.galili@gmail.com> 
Suggestions and bug-reports can be submitted at: https://github.com/talgalili/i$ 


To suppress this message use: | 
suppressPackageStartupMessages| 
| 


Warning message: Once your packages are copied to the new R, 

Package ‘installr’ was built under R version 3.4.1 do you wish to KEEP the packages from the library in the OLD R 

> updateR() installation? 

Installing the newest version of R, (if you choose 'NO' - you will erase your packages in the old R version) 


Please wait for the installer file to be download and! 
Be sure to click 'next' as needed... 

trying URL ‘https://cran.rstudio.com/bin/windows/base/] 4 
Content type '‘application/x-msdos-program' length 7808 Yes | No | 
downloaded 74.5 MB { 


You can even move your Rprofile.site from older version to keep all your customised settings. 
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Ge RGui (64-bit) _ x 
File Edit View Misc Packages Windows Help _ installr 


[2] (3) 


GR Console 


> library (installr) 
Loading required package: stringr 


Welcome to installr version 0.19.0 


More information is available on the installr project website: 
https://github.com/talgalili/installr/ 


Contact: <tal.galili@gmail.com> 
Suggestions and bug-reports can be submitted at: https://github.com/talgalili/is$ 


To suppress this message use: 
suppressPackageStartupMessages Question 


Warning message: 


package ‘installr’ was built under R version 3.4.1 Do you wish to copy your 'Rprofile.site’ from the older version of Rto 
> updateR () the newer version of R? 
Installing the newest version of R, 


Please wait for the installer file to be download and 
Be sure to click 'next' as needed... 

trying URL "https: //cran.rstudio.com/bin/windows/base/H 
Content type ‘application/x-msdos-program' length 7808 
downloaded 74.5 MB 


Section 127.4: Updating Packages 


You can update your installed packages once the updating of R is done. 


GR RGui (64-bit) = x 
File Edit View Misc Packages Windows Help _installr 


GR Console [so [es] 


> library (installir) 
Loading required package: stringr 


Welcome to installr version 0.19.0 


More information is available on the installr project website: 
https://github.com/talgalili/installr/ 


Contact: <tal.galili@gmail.com> 
Suggestions and bug-reports can be submitted at: https://github.com/talgalili/i$ 


To suppress this message use: — 
suppressPackageStartupMessages ag Question 
Warning message: | 

package ‘installr’ was built under R version 3.4.1 

> updateR() @ Do you wish to update your packages in the newly installed R? 
Installing the newest version of R, 

please wait for the installer file to be download and ex 
Be sure to click 'next' as needed... | 
trying URL 'https://cran.rstudio.com/bin/windows/base/R-3 
Content type ‘application/x-msdos-program' length 780865] 
downloaded 74.5 MB 


Once its done Restart R and enjoy exploring. 


Section 127.5: Check R Version 


You can check R Version using the console 
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